Loan Default Prediction¶

Problem Definition¶

The Context:¶

  • In the context of banking, it is key that effective decisions are made regarding the security of loans that are lent to various customers. Whether this be in the context of loans to corporate entities or individuals, various banks' business models rely in significant part on the effective repayment of any loans awarded to its clients.

  • Given that repayment of interest on loans is a significant source of income for banks, it is crucial that a decision making process is established that effectively allows banks to check the creditworthiness of any given loan applicant clearly. Historically, the process was done mostly via manual study of various aspects of the application. Any manual, labour-intensive process, however, naturally opens the door for human error to be made. Whether a result of wrong judgement, intense workload under stress or simple biases, the process is not perfect when performed by humans.

  • Even though there have historically been efforts to automate this process to a degree by using heuristics, with the advent of data science and machine learning, an avenue has opened up for automation of the process in a way that is simeltaneously easily interpretable and rooted in empirical scientific practices and methods.

The objective:¶

  • To build a classification model that can perform two tasks at the same time. 1: To be able to accurately discern those clients/applicants who are likely to default on loans from those who are not via statistically sound methods. 2: To be interpretable enough to provide an understandable explanation of why any given applicant had their application rejected, given that that is the case.

The key questions:¶

  • What are the key factors that signal that any given applicant is likely to default on a Loan?
  • What data collection practices are in place that are helpful to a bank's understanding of whether an applicant is likely to default on a loan?
  • Subsequently, what further practices could be established/changed in order to allow us to predict this more accurately?
  • How can we construct a model (via what method) that would most accurately be able to correctly classify loan applicants as either likely or unlikely to default on a loan?
  • Subsequently, in construction of potential commercially-applicable models, what metrics could we use in order to best gauge their effectiveness?
  • How can we ultimately minimise the financial damage done to a bank's business via loan defaults using statistical modelling?

The problem formulation:¶

  • To build several models based on our understanding of the data currently collected by the bank in order to both accurately and explainably predict which loan applicants are likely to default on their loans and cause damage to the bank's business.

Data Description:¶

The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20 percent). 12 input variables were registered for each applicant.

  • BAD: 1 = Client defaulted on loan, 0 = loan repaid

  • LOAN: Amount of loan approved.

  • MORTDUE: Amount due on the existing mortgage.

  • VALUE: Current value of the property.

  • REASON: Reason for the loan request. (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts)

  • JOB: The type of job that loan applicant has such as manager, self, etc.

  • YOJ: Years at present job.

  • DEROG: Number of major derogatory reports (which indicates a serious delinquency or late payments).

  • DELINQ: Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due).

  • CLAGE: Age of the oldest credit line in months.

  • NINQ: Number of recent credit inquiries.

  • CLNO: Number of existing credit lines.

  • DEBTINC: Debt-to-income ratio (all your monthly debt payments divided by your gross monthly income. This number is one way lenders measure your ability to manage the monthly payments to repay the money you plan to borrow.

Import the necessary libraries and Data¶

In [5]:
import pandas as pd
import numpy as np

#importing our visualisation libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

#removing the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
#setting a maximum display of 2 decimal places for pandas
pd.options.display.float_format = "{:,.2f}".format

#importing models for regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

#importing our decision tree classifier models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from xgboost import XGBClassifier

#importing the library to encode categorical variables
from sklearn.preprocessing import LabelEncoder

#importing GridSearchCV for hyperparameter tuning
from sklearn.model_selection import GridSearchCV

#importing metrics to check model performance
from sklearn.metrics import confusion_matrix, recall_score, precision_score, accuracy_score, classification_report, make_scorer

#importing the warnings library and muting warnings
import warnings
warnings.filterwarnings("ignore")

Data Overview¶

  • Reading the dataset
  • Understanding the shape of the dataset
  • Checking the data types
  • Checking for missing values
  • Checking for duplicated values
In [8]:
df = pd.read_csv('hmeq.csv')
In [9]:
df.head()
Out[9]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
0 1 1100 25,860.00 39,025.00 HomeImp Other 10.50 0.00 0.00 94.37 1.00 9.00 NaN
1 1 1300 70,053.00 68,400.00 HomeImp Other 7.00 0.00 2.00 121.83 0.00 14.00 NaN
2 1 1500 13,500.00 16,700.00 HomeImp Other 4.00 0.00 0.00 149.47 1.00 10.00 NaN
3 1 1500 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0 1700 97,800.00 112,000.00 HomeImp Office 3.00 0.00 0.00 93.33 0.00 14.00 NaN
In [10]:
df.tail()
Out[10]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
5955 0 88900 57,264.00 90,185.00 DebtCon Other 16.00 0.00 0.00 221.81 0.00 16.00 36.11
5956 0 89000 54,576.00 92,937.00 DebtCon Other 16.00 0.00 0.00 208.69 0.00 15.00 35.86
5957 0 89200 54,045.00 92,924.00 DebtCon Other 15.00 0.00 0.00 212.28 0.00 15.00 35.56
5958 0 89800 50,370.00 91,861.00 DebtCon Other 14.00 0.00 0.00 213.89 0.00 16.00 34.34
5959 0 89900 48,811.00 88,934.00 DebtCon Other 15.00 0.00 0.00 219.60 0.00 16.00 34.57
In [11]:
df.shape
Out[11]:
(5960, 13)
In [12]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5960 entries, 0 to 5959
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   BAD      5960 non-null   int64  
 1   LOAN     5960 non-null   int64  
 2   MORTDUE  5442 non-null   float64
 3   VALUE    5848 non-null   float64
 4   REASON   5708 non-null   object 
 5   JOB      5681 non-null   object 
 6   YOJ      5445 non-null   float64
 7   DEROG    5252 non-null   float64
 8   DELINQ   5380 non-null   float64
 9   CLAGE    5652 non-null   float64
 10  NINQ     5450 non-null   float64
 11  CLNO     5738 non-null   float64
 12  DEBTINC  4693 non-null   float64
dtypes: float64(9), int64(2), object(2)
memory usage: 605.4+ KB
In [13]:
df.duplicated().value_counts()
Out[13]:
False    5960
Name: count, dtype: int64
In [14]:
df.nunique()
Out[14]:
BAD           2
LOAN        540
MORTDUE    5053
VALUE      5381
REASON        2
JOB           6
YOJ          99
DEROG        11
DELINQ       14
CLAGE      5314
NINQ         16
CLNO         62
DEBTINC    4693
dtype: int64
In [15]:
df.isnull().sum()
Out[15]:
BAD           0
LOAN          0
MORTDUE     518
VALUE       112
REASON      252
JOB         279
YOJ         515
DEROG       708
DELINQ      580
CLAGE       308
NINQ        510
CLNO        222
DEBTINC    1267
dtype: int64

Summary Statistics¶

In [17]:
df.describe().T
Out[17]:
count mean std min 25% 50% 75% max
BAD 5,960.00 0.20 0.40 0.00 0.00 0.00 0.00 1.00
LOAN 5,960.00 18,607.97 11,207.48 1,100.00 11,100.00 16,300.00 23,300.00 89,900.00
MORTDUE 5,442.00 73,760.82 44,457.61 2,063.00 46,276.00 65,019.00 91,488.00 399,550.00
VALUE 5,848.00 101,776.05 57,385.78 8,000.00 66,075.50 89,235.50 119,824.25 855,909.00
YOJ 5,445.00 8.92 7.57 0.00 3.00 7.00 13.00 41.00
DEROG 5,252.00 0.25 0.85 0.00 0.00 0.00 0.00 10.00
DELINQ 5,380.00 0.45 1.13 0.00 0.00 0.00 0.00 15.00
CLAGE 5,652.00 179.77 85.81 0.00 115.12 173.47 231.56 1,168.23
NINQ 5,450.00 1.19 1.73 0.00 0.00 1.00 2.00 17.00
CLNO 5,738.00 21.30 10.14 0.00 15.00 20.00 26.00 71.00
DEBTINC 4,693.00 33.78 8.60 0.52 29.14 34.82 39.00 203.31
In [18]:
df.describe(include = 'object').T
Out[18]:
count unique top freq
REASON 5708 2 DebtCon 3928
JOB 5681 6 Other 2388
  • Observations from Summary Statistics

Exploratory Data Analysis (EDA) and Visualization¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Leading Questions:

  1. What is the range of values for the loan amount variable "LOAN"?
  2. How does the distribution of years at present job "YOJ" vary across the dataset?
  3. How many unique categories are there in the REASON variable?
  4. What is the most common category in the JOB variable?
  5. Is there a relationship between the REASON variable and the proportion of applicants who defaulted on their loan?
  6. Do applicants who default have a significantly different loan amount compared to those who repay their loan?
  7. Is there a correlation between the value of the property and the loan default rate?
  8. Do applicants who default have a significantly different mortgage amount compared to those who repay their loan?

Univariate Analysis¶

Let's take a look at both our categorical and numeric values. Here, I will re-use the functions I created for the elective project previously, as I believe they delve into all the necessary detail for our univariate analysis here.

In [24]:
#Defining our function to plot a combined plot of histrogram, boxplot and violin plot for numeric univariate analysis.. 
#We could choose to customise the parameters every time we call the function: however, we will
#only call it a limited number of times through this EDA as we will use a for loop to 
#iterate through the variables when calling it.

#This is the primary function we will use for our univariate analysis of numeric features.

def histogram_boxplot_violin(data, variable):
    #Here we create a subplot and allocate it to different plot types that we are creating, making them share
    #an x-axis for every variable, and allocating our common figsize
    figure, (ax_box, ax_histogram, ax_violin) = plt.subplots(3, sharex=True, figsize=(10, 8))
    #here we create our histogram, with kde=true to give us a better visualisation of distribution of the variable
    sns.histplot(data=data, x=variable, kde=True, ax = ax_histogram)
    #now we create our boxplot, and having it show our mean on the plot directly.
    sns.boxplot(data=data, x=variable, ax=ax_box, showmeans=True, color = 'lightblue')
    #and, lasttly, we create our violin plot, and have it display the quartile split to show us the distribution by
    #quartile for each variable intuitively
    sns.violinplot(data=data, x=variable, ax=ax_violin, inner='quartile', color = 'green')
    
    #we create vertical lines of consistent colors to show the median and means on each plot
    ax_histogram.axvline(data[variable].mean(), color = 'red', linestyle = '-')
    ax_histogram.axvline(data[variable].median(), color = 'purple', linestyle = '-')
    ax_violin.axvline(data[variable].mean(), color = 'red', linestyle = '-')
    ax_violin.axvline(data[variable].median(), color = 'purple', linestyle = '-')
    
    #Now we assign each plot a title. We use suptitle as opposed to title so it doesn't show individual titles for every
    #subplot
    plt.suptitle(variable.upper())
    #and here we have it show us plot without a warning message
    plt.show()
In [25]:
#defining our barplot piechart combined graph to perform univariate analysis on categorical features.

def barplot_piechart(data, variable, figsize=(10, 7)):
    #assigning subplots to our bar plots and pie charts. Also, this time I am making the figsize modifiable as the
    #counts and proportions within each plot may vary widely, so it will be more convenient for me to find the best
    #visualisation size this way
    figure, (ax_bar, ax_pie) = plt.subplots(2, figsize=figsize)
    #creating our bar plot
    sns.countplot(data=data, x=variable, ax=ax_bar)
    #creating our pie chart, including the labels for it
    count = data[variable].value_counts()
    label = count.index
    size = count.values
    #when creating a pie chart, we assign it directly to an ax rather than giving it as an argument.
    #additionally, we have it display both the labels and the percentages for each value in every feature.
    ax_pie.pie(size, labels=label, autopct='%1.1f%%', colors=['gold','red','lightgreen'])
    #the reason we assign 3 colours only is because there is a max of 3 variables that are seen in any given feature.
    
    plt.suptitle(variable.upper())
    plt.show()
In [26]:
#defining our numeric and categorical columns
num_cols = df.select_dtypes(include=['number']).columns.to_list()
cat_cols = df.select_dtypes(include=['object']).columns.to_list()
In [27]:
for col in df[num_cols]:
    histogram_boxplot_violin(df, col)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [28]:
df.describe().T
Out[28]:
count mean std min 25% 50% 75% max
BAD 5,960.00 0.20 0.40 0.00 0.00 0.00 0.00 1.00
LOAN 5,960.00 18,607.97 11,207.48 1,100.00 11,100.00 16,300.00 23,300.00 89,900.00
MORTDUE 5,442.00 73,760.82 44,457.61 2,063.00 46,276.00 65,019.00 91,488.00 399,550.00
VALUE 5,848.00 101,776.05 57,385.78 8,000.00 66,075.50 89,235.50 119,824.25 855,909.00
YOJ 5,445.00 8.92 7.57 0.00 3.00 7.00 13.00 41.00
DEROG 5,252.00 0.25 0.85 0.00 0.00 0.00 0.00 10.00
DELINQ 5,380.00 0.45 1.13 0.00 0.00 0.00 0.00 15.00
CLAGE 5,652.00 179.77 85.81 0.00 115.12 173.47 231.56 1,168.23
NINQ 5,450.00 1.19 1.73 0.00 0.00 1.00 2.00 17.00
CLNO 5,738.00 21.30 10.14 0.00 15.00 20.00 26.00 71.00
DEBTINC 4,693.00 33.78 8.60 0.52 29.14 34.82 39.00 203.31

Observations

  • As a first interesting point, it appears that the entire numeric dataset is heavily right-skewed. This seems to be caused by, at least partially, a significant number of outliers on the upper side of every feature. Let's discuss the findings in a little further detail to see why this may be the cause.
  • It appears that roughly 80% of the dataset is composed of clients that have successfully repaid their loans, and roughly 20% of those clients that defaulted on their loan. Therefore, as this is effectively a categorical piece of data rather than numeric, we will analyse it further in our categorical data analysis.
  • The average size of approved loans is c.18,607 (presumably USD for the purposes of this study), with the median sitting slightly lower at c.17,000. However, loans as large as 89,900 have also been awarded, with a significant number of outliers on the upper side of the column. The value of the loans, as mentioned, is heavily right skewed, and few loans above c.22,000 (3rd quartile) have been approved in reality. The single smallest awarded loan, meanwhile, was USD1,100.
  • On average, loans have been approved for clients that have c.74,000 left to pay on their existing mortgage. However, some loans have also been approved for clients who still have as much as 399,550 left to pay. Half of all loans approved for those whose mortgage due is within the c.45,000-90,000 range. It would be interesting to see the relationship between the mortgage due and the likelihood of a client defaulting on their loan further in our bivariate analysis.
  • The value of the property for those whose loans have been approved is, on average, c.100,000, with clients typically owning a property valued in the range of 66,000 and 120,000. This is quite interesting, as we could infer something about the client base based on this finding. The majority of them do not seem to live in major urban population centres. House prices within cities tend to be much higher than in suburbs or the countryside, and, given 2024 reporting on house prices (The Motley Fool, Zillow, Census Bureau, 2024: https://www.fool.com/the-ascent/research/average-house-price-state/), the properties seem to be well below the average house prices anywhere in the US. However, the data collected is likely spread our over time, which could significantly affect this assumption due to a sharp increase in house prices over the past few years. It would be interesting to introduce an additional feature regarding the date of when the loan was approved, to gain a deeper understanding of the clientele of the bank. The highest-valued properties of approved loan-takers sit as high as 855,909.
  • It appears that the average number of years spent in their current job for an approved loan taker is c.9 years, with 50% of the client base having spent between 3 to 13 years in their current role.
  • On average, those clients who were awarded loans have 0 to 1 derogatory reports, as well as no delinquent credit lines. This could serve as a strong historic indicator for whether a customer is able to pay off their loans, and would naturally be a red flag for a loan award if the client has not been able to perform due upkeep on their existing credit lines in the past. However, it appears that in some cases, loans have been approved for those who have had as many as 10 derogatory reports and 15 delinquent credit lines. It could be interesting to see how important of an indicator this will prove in our models later on, and it would be interesting to, outside of the field of data, understand the decision making process of awarding a loan to those clients who have a proven history of payment delinquency or late payment.
  • Interestingly, it appears that the average age of the oldest credit line for customers in the dataset is c.15 years. Given that the average client has spent c.9 years in their current role, it is likely that these credit lines were opened for one of their first major purchases in life, such as a car, as it is unlikely that many of the customers have financed a property before entering the working world.
  • On average, customers that have had their loans approved have filed between 0-2 credit line inquiries in the recent time frame (75% of all approved loans). It would be useful to gain an understanding of what the timeframe for this is, as the limited information means that they could have inquired regarding these credit lines for various reasons.
  • Customers have, on average, had c.21 credit lines opened by the time of having their loan approved. It would be interesting to see the relationship between the number of credit lines opened and our other variables, in order to get a deeper insight into the number of loans and other features of any given client.
  • Clients, on average, have a 33.8 debt-to-credit ratio, with half of all loans being approved for those with c.29-39 debt to income ratio. For outliers on the higher end of the spectrum, the ratio has been as high as 100-150 and, in one case, c.200.
In [30]:
cat_cols.append('BAD')

for col in cat_cols:
    barplot_piechart(df, col)
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Observations

  • The most frequent reason given for loans that have been approved has been Debt Consolidation, at 68.8% of cases in the dataset. Home improvement loans make up the remaining 31.2% of the dataset.
  • The most frequent job category is 'Others', which is not surprising, considering the loans are given for personal reasons (as they are Home Equity Loans), and it is unlikely that the majority of loan customers within the banking system would occupy one of 5 types of employment. This category makes up 42% of the dataset. The most frequent definite measured category is 'ProfExe', or Professional Executive. They make up 22.5% of the dataset, and it is not surprising that they are the most frequently seen definite category, as they are likely to have the highest levels of income and a stable history of income, having worked they way up to the C-Suite. This is followed by office workers at 16.7%, managerial role holders at 13.5%, self-employed workers at 3.4% and sales staff making up the remaining 1.9%.
  • From all loans given, 80.1% have been successfully repaid, and 19.9% have been defaulted on. This is quite a high default ratio, as it effectively means 1 in 5 customers who come to take out a loan will end up causing the bank, on average, c.20,000 of financial damage.
  • In this dataset alone, we can see that these mistakes have cost the bank circa USD22,179,544 in the loans themselves, not accounting for the administrative costs of each loan.

Bivariate Analysis¶

In [33]:
#looking at the distribution of loan values sorted by whether the customer defaulted on a loan or not
sns.boxplot(df, x='BAD', y='LOAN')
plt.show()
No description has been provided for this image

Observations

  • It immediately appears that customers who default on their loan are likely to take out smaller loans on average. It would be interesting to see the relationship between the average size of the loan and the reason for the loan being taken out, to see of we can get some further insight into this.
In [35]:
sns.boxplot(df, x='BAD', y='LOAN', hue='REASON')
plt.show()
No description has been provided for this image
  • Interestingly, it seems that customers who take out a loan for reason of home improvement are likely to take a smaller much smaller loan if they default rather than if they pay it off. Meanwhile, customers who take out loans for reason of debt consolidation have a much smaller disparity in loan size on average between those who default on the loan and those who don't.

Next up, let's see how someone's job affects whether or not they have defaulted on their loan.

In [38]:
sns.countplot(df, x='JOB', hue='BAD')
plt.show()
No description has been provided for this image

Observations

  • It appears that those workers who occupy sales roles are proportionally the most likely to default on their loan. This could be due to the fact that sales roles' income can be somewhat inconsistent, as we will discuss in more detail in our multivariate analysis. They are closely followed by self-employed individuals, likely for similar reasons.
  • Professional Executives and Office workers appear proportionally the least likely to default on their loan compared to clients in other roles.
  • Those occupying 'Other' job roles and Managerial roles are not that likely to default on their loans, but are somewhat more risky clients than Professional Executives and Office workers.

Now let's see if there is a relationship between the likelihood of a client defaulting on a loan, and the reason that they take the loan out.

In [41]:
sns.countplot(df, x='REASON', hue='BAD')
plt.show()
No description has been provided for this image

Observations

  • Those who take out a loan for debt consolidation are proportionally much less likely to default on their loan as opposed to those who take it out for home improvement. This is potentially driven by the fact that the clientele who have already invested in their home equity previously and are now working to consolidate their debts are more likely to have a knowledge of managing their finances, which is a correlation that cannot be definitively drawn for those who take out loans for home improvement purposes.
In [43]:
sns.histplot(df, x='MORTDUE', hue='BAD', alpha=0.7, kde=True)
plt.show()
No description has been provided for this image

Observations

  • It appears that those clients whose mortgage amount due is around 55k are the most likely to default, and those who are most likely to pay off the loan have a mortgage due is around 60k. However, the distributions for the two groups looks quite similar. We will delve a little further into this in our multivariate analysis, as I am interested to see how these factors compare to the value of the loan and the total property value, to see if there are clear correlations in the home financials.

Multivariate Analysis¶

In [46]:
corr_columns = df[(df.select_dtypes(include=[np.number]).columns.tolist())].corr()
corr_columns
Out[46]:
BAD LOAN MORTDUE VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
BAD 1.00 -0.08 -0.05 -0.03 -0.06 0.28 0.35 -0.17 0.17 -0.00 0.20
LOAN -0.08 1.00 0.23 0.34 0.11 -0.00 -0.04 0.09 0.04 0.07 0.08
MORTDUE -0.05 0.23 1.00 0.88 -0.09 -0.05 -0.00 0.14 0.03 0.32 0.15
VALUE -0.03 0.34 0.88 1.00 0.01 -0.05 -0.01 0.17 -0.00 0.27 0.13
YOJ -0.06 0.11 -0.09 0.01 1.00 -0.07 0.04 0.20 -0.07 0.02 -0.06
DEROG 0.28 -0.00 -0.05 -0.05 -0.07 1.00 0.21 -0.08 0.17 0.06 0.02
DELINQ 0.35 -0.04 -0.00 -0.01 0.04 0.21 1.00 0.02 0.07 0.16 0.05
CLAGE -0.17 0.09 0.14 0.17 0.20 -0.08 0.02 1.00 -0.12 0.24 -0.05
NINQ 0.17 0.04 0.03 -0.00 -0.07 0.17 0.07 -0.12 1.00 0.09 0.14
CLNO -0.00 0.07 0.32 0.27 0.02 0.06 0.16 0.24 0.09 1.00 0.19
DEBTINC 0.20 0.08 0.15 0.13 -0.06 0.02 0.05 -0.05 0.14 0.19 1.00
In [47]:
mask = np.triu(corr_columns)
#now, we create our heatmap based on the correlations above, with the mask applied, annotating it, choosing a classic
#blue-red colour scheme, and with the max value being 1. Since we do not have any correlations below -0.1, we will not
#define a lower bound via vmin. Additionally, whilst we do include vmax = 1 here, it is more for sanity than
#a precise use as we know from the above table that none of the correlations go above 0.31.
sns.heatmap(corr_columns, mask=mask, annot = True, cmap = 'coolwarm', vmax = 1, fmt='.2f')
plt.show()
No description has been provided for this image

Observations

Next up, we will use plotly to create a few 3d graphs for various purposes as part of our multivariate analysis. Firstly, I want to have a look at the relationship between the following variables: A customer's job, how many years they have spent in that position, and how that stacks up to whether they have defaulted on their loan or not.

In [50]:
client_profile_3d = px.scatter_3d(df, x='YOJ', y='LOAN', z='BAD', color='JOB',
                                  labels = dict(YOJ='Years in current role', LOAN='Loan Value', 
                                               BAD='Loan Defaulted or Repaid'))
client_profile_3d.show()

Observations

  • It appears that Professional Executives are much less likely to default on their loans, than any other job. They also seem to predominantly spend 20 years or less in their current role, with some exceptions. Their loans awarded, interestingly, do not seem to be significantly higher on average than the other categories. They also do not seem to hold the highest outlier status when it comes to the value of the loans, interestingly. This could be caused by their already high income not necessitating exceptionally high loans for the purposes of their home, as they may decide to partially cover the costs out of pocket. Interstingly, however, several executives also held some of the higher loans that have been defaulted on.
  • As per our EDA, those in the 'Other' job category are by far the most prevalent in the dataset. They have some of the most expensive approved loans (as high as nearly 90k USD). They also are responsible for some of the highest loans that have been either repaid, or defaulted on. It is tough to draw many conclusions regarding this category, as it is all-encompassing, and may involve anybody outside of the distinct corporate roles categorised specifically in this dataset.
  • Self-Employed individuals appear to proportionally receive some of the highest loans in the dataset. The category holds some of the highest repaid loans of the dataset (only overshadowed by outliers in the 'Other' job category. However, they also hold the two spots for the highest defaulted loans (77.4k and 77.2k USD). They also seem to have a comparatively average tenure in their current role, a few outliers notwithstanding.
  • Similarly to the self-employed, Salespeople seem to have some of the shorter tenures in their roles on average. This could be explained by the fact that sales roles are often a stepping stone on the way to further roles in the corporate world: additionally, sales roles are taxing on individuals, with high rates of attrition often seen in the industry (as per my own experience). These roles are also apparently quite likely to default on their loans compared to others, which conforms with our bivariate EDA. This is possibly the case due to the fact that individual income in sales roles is performance driven - in a lot of cases, it is not as stable as that seen in other roles, and a bad year for the firm/year of poor or average performance can often lead to a significant drop in income.

As discussed earlier in our bivariate analysis, let's have a more detailed look at the home financials of our bank's clients, to get a better picture of where there may be some interesting correlations.

In [53]:
property_financials_3d = px.scatter_3d(df, x='LOAN', y='VALUE', z='MORTDUE', color='BAD',
                                  labels = dict(VALUE='Total Property Value', LOAN='Loan Value', 
                                               BAD='Loan Defaulted or Repaid', MORTDUE='Mortgage Due'))
property_financials_3d.show()

Observations

  • Expectedly, and as seen in our correlation heatmap, the amount of mortgage due on the client's property is strongly positively correlated to the total value of the property.
  • It appears that there is very little, if any, correlation between either the total value of a client's property or the amount due on their mortgage with the value of the loan awarded.
  • Interestingly, it appears that save for some outliers, there is a slight positive relationship between the value of the loan awarded and the loan being repaid. Interestingly, these two factors look more correlated than the value of the loan and whether it is defaulted on, which appears to be much more randomly distributed.

Next, in order to get a better understanding of the financial history of our clients, it would be interesting to see the relationship between their job role, their number of open credit lines, number of major derogatory reports and number of delinquent credit lines. This is to get a better understanding of the customers before they were awarded their loan.

In [56]:
financial_history_3d = px.scatter_3d(df, x='CLNO', y='DELINQ', z='DEROG', color='JOB',
                                  labels = dict(CLNO='Total No. Of Credit Lines', JOB='Current Job', 
                                               DELINQ='No. of Delinquent Credit Lines', DEROG='No. of Derogatory Reports'))
financial_history_3d.show()

Observations

  • Here we can begin to see an interesting trend: those occupying 'Other' roles are the most likely to have a high number of derogatory reports and to have the highest number of delinquent credit lines. We will delve deeper into this further in our next graph, as it would be interesting to see if there is a correlation between the job of the client, their number of derogatory reports and their property value.
  • The second highest likelihood to have a high number of derogatory reports and delinquent credit lines, interestingly enough, is seen in managerial workers. This is interesting, and runs contrarily to our bivariate analysis earlier on, as managerial workers have not been amidst the more likely occupation profiles to default on their loan.
  • Interestingly, we can see a particularly self-contrary ourlier in this graph: a self-employed individual with 45 open credit lines, of which 15 are delinquent, but with only 2 derogatory reports. As this is a self-employed individual, it is possible that some of their credit lines were opened under their business, or were consolidated under their business entity. This could call into question the way that this data was collected, as it shows that it is possible that some of the applicants lied about some of their data, or at the very least were able to misrepresent their true credit history via creative accounting.

As mentioned before, let's have a look at the relationship between our clients' occupation, the number of delinquent credit lines they hold, the number of derogatory reports they hold and their property value/mortgage due to get a better picture of a combination of their broad financial profile and history.

In [59]:
client_overall_profile = px.scatter(df, x='VALUE', y='DELINQ', color='JOB',
                                   labels={'VALUE': 'Total Property Value',
                                          'DELINQ': 'No. of Delinquent Credit Lines'})
client_overall_profile.show()
In [60]:
client_overall_profile2 = px.scatter(df, x='MORTDUE', y='DEROG', color='JOB',
                                   labels={'MORTDUE': 'Total Mortgage Due',
                                          'DEROG': 'No. of Derogatory Reports'})
client_overall_profile2.show()
In [61]:
managers_df = df[df['JOB'] == 'Mgr']
managers_df['BAD'].value_counts()
Out[61]:
BAD
0    588
1    179
Name: count, dtype: int64

Observations

  • It appears that broadly, those who own a lesser valued property are somewhat more likely to have a higher number of delinquent credit lines.
  • Professional executives own the highest range of property values in the dataset, with exception of two 'other' outliers. However, the 'other' outliers do not appear to have a significantly higher amount of mortgage due to pay than others. This indicates that those two outliers have already paid off a significant amount of their mortgage, and have done so on time given their lack of derogatory reports or delinquent credit lines.
  • Self-employed individuals, professional executives and those falling under the 'other' category seem to be represented highly in a lesser number of delinquent CLs and Derogatory Reports. Interestingly, those in managerial roles and office roles are much more likely to hold a higher number of delinqient credit lines and especially derogatory reports proportionally to other groups. This could be due to poor standards in loan analysis by former loaners, but could also be the result of non-sales performance-driven income roles, such as front office commercial roles in trading organisations or banks.
  • Those in the 'Other' category, as elsewhere throughout our data analysis, are spread out relatively normally throughout the various features of the set, as the category is all-encompassing and may include workers throughout a number of occupations.

Treating Missing Values¶

In [64]:
df.isnull().sum()
Out[64]:
BAD           0
LOAN          0
MORTDUE     518
VALUE       112
REASON      252
JOB         279
YOJ         515
DEROG       708
DELINQ      580
CLAGE       308
NINQ        510
CLNO        222
DEBTINC    1267
dtype: int64
In [65]:
null_rows = df[df.isnull().any(axis=1)]
null_rows
Out[65]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
0 1 1100 25,860.00 39,025.00 HomeImp Other 10.50 0.00 0.00 94.37 1.00 9.00 NaN
1 1 1300 70,053.00 68,400.00 HomeImp Other 7.00 0.00 2.00 121.83 0.00 14.00 NaN
2 1 1500 13,500.00 16,700.00 HomeImp Other 4.00 0.00 0.00 149.47 1.00 10.00 NaN
3 1 1500 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0 1700 97,800.00 112,000.00 HomeImp Office 3.00 0.00 0.00 93.33 0.00 14.00 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ...
5944 0 81400 22,344.00 108,280.00 HomeImp NaN 25.00 0.00 0.00 148.11 0.00 14.00 34.29
5945 0 81400 21,041.00 111,304.00 HomeImp NaN 26.00 0.00 0.00 146.32 0.00 14.00 32.13
5946 0 82200 16,097.00 111,287.00 HomeImp NaN 26.00 0.00 0.00 142.12 0.00 14.00 31.74
5947 0 82200 23,197.00 110,481.00 HomeImp NaN 26.00 0.00 1.00 127.77 0.00 14.00 30.94
5948 0 86000 47,355.00 85,000.00 DebtCon Other 15.00 0.00 0.00 210.97 0.00 16.00 NaN

2596 rows × 13 columns

Let's start with the categorical columns, as there are only two, for our missing value treatment

In [67]:
df['REASON'].value_counts()
Out[67]:
REASON
DebtCon    3928
HomeImp    1780
Name: count, dtype: int64
In [68]:
df['JOB'].value_counts()
Out[68]:
JOB
Other      2388
ProfExe    1276
Office      948
Mgr         767
Self        193
Sales       109
Name: count, dtype: int64
In [69]:
df['REASON'].isnull().sum()
Out[69]:
252
In [70]:
df['JOB'].isnull().sum()
Out[70]:
279
In [71]:
missing_cat = df[df['REASON'].isnull() & df['JOB'].isnull()]
missing_cat.shape
Out[71]:
(107, 13)

The total amount of rows missing one of the categorical columns is 424, with 107 of them missing both the reason for the loan and the job the applicant holds. However, a lot of these rows contain information that would be relevant to include in the model. Therefore, in order to handle the missing values here, we will perform a series of steps:

  1. Remove any row which is missing both of the pieces of information needed.
  2. For the remaining rows, impute the mode of the column
  • The reason why we are not dropping all the rows with at least 1 missing value is because doing so would incur a nearly 8% loss of information in the dataset.
In [73]:
missing_cat_index = df[df['REASON'].isnull() & df['JOB'].isnull()].index
missing_cat_index
Out[73]:
Index([   3,   10,   17,   51,   73,  112,  115,  143,  237,  268,
       ...
       4418, 4581, 4632, 4660, 4680, 4880, 4947, 5331, 5348, 5468],
      dtype='int64', length=107)
In [74]:
df = df.drop(missing_cat_index, axis = 0)
In [75]:
df[df['REASON'].isnull() & df['JOB'].isnull()]
Out[75]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
In [76]:
df['REASON'].fillna((df['REASON'].mode()[0]), inplace=True)
df['JOB'].fillna((df['JOB'].mode()[0]), inplace=True)
In [77]:
df.isnull().sum()
Out[77]:
BAD           0
LOAN          0
MORTDUE     445
VALUE       100
REASON        0
JOB           0
YOJ         432
DEROG       623
DELINQ      495
CLAGE       210
NINQ        425
CLNO        137
DEBTINC    1249
dtype: int64
In [78]:
df.shape
Out[78]:
(5853, 13)

Now, let's move onto cleaning our numeric features. As there is a significant amount of data missing, we cannot remove all of our missing values nor simply fill in missing values with the averages, as this could A. introduce a significant degree of bias into our dataset and B. significantly impact the effectiveness of our models we will build later on.

Therefore, in order to handle the missing numeric data, we will impute the missing values using KNN imputation. This will impute our missing values with an estimation for each feature using other, available data points within each row. Additionally, seeing as we will have multiple missing values across several features in some rows, we will be using the iterative imputer in order to go through them step by step and allow us to fill them all in accurately, preserving the integrity and consistency of our dataset. Further details on KNN Imputation can be found in the official Scikit-Learn documentation: https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer

In [80]:
df.describe().T
Out[80]:
count mean std min 25% 50% 75% max
BAD 5,853.00 0.20 0.40 0.00 0.00 0.00 0.00 1.00
LOAN 5,853.00 18,696.40 11,253.38 1,100.00 11,100.00 16,400.00 23,400.00 89,900.00
MORTDUE 5,408.00 73,914.26 44,504.99 2,063.00 46,404.50 65,106.50 91,740.25 399,550.00
VALUE 5,753.00 102,112.99 57,638.65 8,000.00 66,193.00 89,673.00 120,000.00 855,909.00
YOJ 5,421.00 8.95 7.57 0.00 3.00 7.00 13.00 41.00
DEROG 5,230.00 0.25 0.84 0.00 0.00 0.00 0.00 10.00
DELINQ 5,358.00 0.45 1.13 0.00 0.00 0.00 0.00 15.00
CLAGE 5,643.00 179.83 85.85 0.00 115.17 173.49 231.67 1,168.23
NINQ 5,428.00 1.19 1.72 0.00 0.00 1.00 2.00 17.00
CLNO 5,716.00 21.35 10.10 0.00 15.00 20.00 26.00 71.00
DEBTINC 4,604.00 34.00 8.38 0.52 29.31 34.93 39.08 203.31

Before we impute our data, let's check the statistical summary of our data as a sanity check. This is to compare the summary pre- and post-imputation, to ensure the overall structure of the dataset remains congruent after our treatment.

In [82]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.neighbors import KNeighborsRegressor
In [83]:
#initialising our KNN regressor with 10 neighbours. This will serve as our estimator.
knn_regressor = KNeighborsRegressor(n_neighbors=10)
#creating our iterative imputer. In addition to our default parameters, we specify the 'ascending' 
#imputation order to ensure that the data with the least missing features is filled first, to give us
#more accurate imputation in rows with a significant number of missing features.
imputer = IterativeImputer(estimator=knn_regressor,  
                           imputation_order='ascending', 
                           random_state=0)
In [84]:
#defining our numeric and categorical columns, so we can have the imputer fill in only missing values
#in numeric columns, as we have already treated our categorical columns.
num_cols = df.select_dtypes(include=[np.number]).columns
cat_cols = df.select_dtypes(exclude=[np.number]).columns
In [85]:
df_num_imputed = pd.DataFrame(imputer.fit_transform(df[num_cols]), columns=num_cols)
In [86]:
df = pd.concat([df[cat_cols].reset_index(drop=True), df_num_imputed.reset_index(drop=True)], axis=1)
In [87]:
df
Out[87]:
REASON JOB BAD LOAN MORTDUE VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
0 HomeImp Other 1.00 1,100.00 25,860.00 39,025.00 10.50 0.00 0.00 94.37 1.00 9.00 32.84
1 HomeImp Other 1.00 1,300.00 70,053.00 68,400.00 7.00 0.00 2.00 121.83 0.00 14.00 40.60
2 HomeImp Other 1.00 1,500.00 13,500.00 16,700.00 4.00 0.00 0.00 149.47 1.00 10.00 31.59
3 HomeImp Office 0.00 1,700.00 97,800.00 112,000.00 3.00 0.00 0.00 93.33 0.00 14.00 34.69
4 HomeImp Other 1.00 1,700.00 30,548.00 40,320.00 9.00 0.00 0.00 101.47 1.00 8.00 37.11
... ... ... ... ... ... ... ... ... ... ... ... ... ...
5848 DebtCon Other 0.00 88,900.00 57,264.00 90,185.00 16.00 0.00 0.00 221.81 0.00 16.00 36.11
5849 DebtCon Other 0.00 89,000.00 54,576.00 92,937.00 16.00 0.00 0.00 208.69 0.00 15.00 35.86
5850 DebtCon Other 0.00 89,200.00 54,045.00 92,924.00 15.00 0.00 0.00 212.28 0.00 15.00 35.56
5851 DebtCon Other 0.00 89,800.00 50,370.00 91,861.00 14.00 0.00 0.00 213.89 0.00 16.00 34.34
5852 DebtCon Other 0.00 89,900.00 48,811.00 88,934.00 15.00 0.00 0.00 219.60 0.00 16.00 34.57

5853 rows × 13 columns

Treating Outliers¶

As we can see in our above data, our dataset is heavily right-skewed for all non-target numeric variables, with several extreme outliers affecting the distribution of our datasets.

There are several outlier treatment techniques we could apply in order to treat them.

Firstly, if we believed there was reason to this, we could simply remove the outlier data from our dataset. However, given the outlier data seemingly represents realistic data points, they are unlikely to be a result of data collection errors.

We will instead take a log transformation of the data in order to better identify out extreme outliers, and manually cap them to our upper bound, as for the purposes of our models (binary classification), this will still make them serve as valid data points, without introducing a significant level of bias for our linear models. We will then exponentiate the data back to its original scale in order to allow us to better interpret and explain our models.

Additionally, there are some outliers which will not be treated: their treatment is unlikely to have an effect on the results of our models as their typical value in the dataset is 0, but they are likely to serve as reliable indicators that the models may pick up upon. These features are the number of delinquent credit lines someone has, and the number of derogatory reports someone holds.

In [89]:
#creating a function for our log transformation of the data.
def log_transformation_treatment(data, feature):
    #taking the log of our feature (and adding 1 to handle 0 values)
    data[feature] = np.log(data[feature] + 1)
    
    #defining our quantiles and interquartile range
    q1 = np.percentile(data[feature], 25)
    q3 = np.percentile(data[feature], 75)
    iqr = q3 - q1
    #defining our upper and lower bounds
    lower = q1 - (iqr * 1.5)
    upper = q3 + (iqr * 1.5)
    
    #capping our features at both the upper or lower quantile
    data[feature] = np.clip(data[feature], lower, upper)
    
    #taking the exponent of our data in order to return it to the original scale
    data[feature] = np.exp(data[feature]) - 1
    
    return data
In [90]:
#Removing our target variable, as well as the features where both Q1 and Q3 are 0 as seen in EDA
num_cols_list = num_cols.tolist()
num_cols_list.remove('BAD')
num_cols_list.remove('DEROG')
num_cols_list.remove('DELINQ')
num_cols = pd.Index(num_cols_list)
In [91]:
#applying our log transformation to lessen the impact of outliers in our models.
for feature in df[num_cols]:
    log_transformation_treatment(df, feature)

df.describe().T
Out[91]:
count mean std min 25% 50% 75% max
BAD 5,853.00 0.20 0.40 0.00 0.00 0.00 0.00 1.00
LOAN 5,853.00 18,649.28 10,930.06 3,626.05 11,100.00 16,400.00 23,400.00 71,620.41
MORTDUE 5,853.00 71,142.93 42,857.13 14,373.55 43,000.00 62,973.00 89,275.00 267,065.23
VALUE 5,853.00 101,230.61 51,798.89 26,977.19 66,004.70 89,634.00 119,846.00 293,220.46
YOJ 5,853.00 9.02 7.42 0.00 3.00 7.00 13.00 41.00
DEROG 5,853.00 0.25 0.80 0.00 0.00 0.00 0.00 10.00
DELINQ 5,853.00 0.45 1.08 0.00 0.00 0.00 0.30 15.00
CLAGE 5,853.00 179.62 82.56 41.73 116.47 172.91 229.51 632.67
NINQ 5,853.00 1.19 1.67 0.00 0.00 1.00 2.00 14.59
CLNO 5,853.00 21.45 9.77 6.30 15.00 20.00 26.00 58.19
DEBTINC 5,853.00 33.86 6.32 20.11 29.74 34.46 38.50 56.53

Important Insights from EDA¶

What are the the most important observations and insights from the data based on the EDA performed?

  • The majority of the dataset is heavily right-skewed in regards to its numeric features. This means that it is unlikely that the primary focus of the bank's loans are aimed at extremely high income brackets (1M+ per year), except for a few select customers who own properties worth nearly 1M USD.
  • It appears that the 'Other' job category is all-encompassing, and includes a number of individuals from widely different income ranges, work roles and credit histories. Therefore it, in itself, is unlikely to serve as a significant predictor in our dataset. A significant improvement to the datasets could potentially be made by allowing individuals to input their own employment, as opposed to choosing one of several pre-determined statuses as appears to be the case here. Then, this data could be aggregated and serve as an interesting case study in itself, allowing the bank to have a deeper understanding of employees in different roles and their proneness to defaulting on loans.
  • 20% of loans given out by the bank in this dataset were defaulted on: this means that the bank effectively mistrusted 1 in 5 applicants, which has resulted in severe financial damage to the bank within the scope of this study alone.
  • Those working in Sales are the most likely to default on their loans. Those in the 'Other' and 'ProfExe' roles are the least likely to default.

Model Building - Approach¶

  • Data preparation
  • Partition the data into train and test set
  • Build the model
  • Fit on the train data
  • Tune the model
  • Test the model on test set

Data Preprocessing¶

Firstly, in order to make sure that our model is able to give us sound results, we will split our dataset into two different variants after pre-processing. We will need to prepare our data via the following steps:

  1. One-hot encoding our categoric variables to ensure that the data we are working with is numeric across the board, making it usable for our models.
  2. Dropping our highly correlated variables: as we saw in the EDA, the value of the client's property and the mortgage due are highly correlated. Therefore, we will drop the property value from our dataset, as the amount of mortgage due is an arguably more important detail when assessing the likelihood of a client defaulting on their credit. This also has the potential to aid the bank with any further administrative procedures, should they reject an applicant: a practice of denying applicants loans based on property value could be deemed a breach of law under the Equal Credit and Opportunity act (As per Disparate Impact), depending on the further proceedings should this be a visible creterion. However, mortgage due is much less likely to be disputed as a factor when assessing the likelihood of a loan default, and it still effectively provides the model with ~88% of total information about an applicant's house value.
  3. Splitting our data using train_test_split. As we have a medium-sized dataset, we will use a 0.8 train and 0.2 test data ratio, to ensure there are sufficient data points for our models to learn from.
  4. For building our logistic regression model, we will need to apply a standard scaler to ensure all of our features are presented to the model at a standard scale, as the units used by different features vary widely and would otherwise significantly affect our model's results. Even though we have already applied a log transformation to our data, this was a method to treat outliers, and is in itself not sufficient to normalise our dataset for linear models.
  5. For our classification tree-based models, we will not be applying a standard scaler, as it is a non-linear model and therefore does not need the units to be in alignment with one another in regards to mean and variance.
In [96]:
#making our categorical columns index into a list to make the next step easy
cat_cols = cat_cols.tolist()
In [97]:
#one-hot encoding our categoric variables
df = pd.get_dummies(df, columns = cat_cols, drop_first=True, dtype='int')
df
Out[97]:
BAD LOAN MORTDUE VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC REASON_HomeImp JOB_Office JOB_Other JOB_ProfExe JOB_Sales JOB_Self
0 1.00 3,626.05 25,860.00 39,025.00 10.50 0.00 0.00 94.37 1.00 9.00 32.84 1 0 1 0 0 0
1 1.00 3,626.05 70,053.00 68,400.00 7.00 0.00 2.00 121.83 0.00 14.00 40.60 1 0 1 0 0 0
2 1.00 3,626.05 14,373.55 26,977.19 4.00 0.00 0.00 149.47 1.00 10.00 31.59 1 0 1 0 0 0
3 0.00 3,626.05 97,800.00 112,000.00 3.00 0.00 0.00 93.33 0.00 14.00 34.69 1 1 0 0 0 0
4 1.00 3,626.05 30,548.00 40,320.00 9.00 0.00 0.00 101.47 1.00 8.00 37.11 1 0 1 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5848 0.00 71,620.41 57,264.00 90,185.00 16.00 0.00 0.00 221.81 0.00 16.00 36.11 0 0 1 0 0 0
5849 0.00 71,620.41 54,576.00 92,937.00 16.00 0.00 0.00 208.69 0.00 15.00 35.86 0 0 1 0 0 0
5850 0.00 71,620.41 54,045.00 92,924.00 15.00 0.00 0.00 212.28 0.00 15.00 35.56 0 0 1 0 0 0
5851 0.00 71,620.41 50,370.00 91,861.00 14.00 0.00 0.00 213.89 0.00 16.00 34.34 0 0 1 0 0 0
5852 0.00 71,620.41 48,811.00 88,934.00 15.00 0.00 0.00 219.60 0.00 16.00 34.57 0 0 1 0 0 0

5853 rows × 17 columns

In [98]:
df.drop('VALUE', axis=1, inplace=True)
df.head()
Out[98]:
BAD LOAN MORTDUE YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC REASON_HomeImp JOB_Office JOB_Other JOB_ProfExe JOB_Sales JOB_Self
0 1.00 3,626.05 25,860.00 10.50 0.00 0.00 94.37 1.00 9.00 32.84 1 0 1 0 0 0
1 1.00 3,626.05 70,053.00 7.00 0.00 2.00 121.83 0.00 14.00 40.60 1 0 1 0 0 0
2 1.00 3,626.05 14,373.55 4.00 0.00 0.00 149.47 1.00 10.00 31.59 1 0 1 0 0 0
3 0.00 3,626.05 97,800.00 3.00 0.00 0.00 93.33 0.00 14.00 34.69 1 1 0 0 0 0
4 1.00 3,626.05 30,548.00 9.00 0.00 0.00 101.47 1.00 8.00 37.11 1 0 1 0 0 0
In [99]:
#Separating our target and non-target variables
x = df.drop('BAD', axis=1)

y = df['BAD']
In [100]:
#Separating our data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, shuffle=True, random_state = 0)
In [101]:
# Checking the shape of the train and test data
print("Shape of Training set : ", x_train.shape, y_train.shape)
print("Shape of test set : ", x_test.shape, y_test.shape)
Shape of Training set :  (4682, 15) (4682,)
Shape of test set :  (1171, 15) (1171,)

Before we move onto steps 3 and 4, let's take a moment to define the metrics for our models.

For the purposes of this study, we will define our null hypothesis as: The target client will not default on the loan (BAD = 0). The alternate hypothesis is hence: BAD = 1.

In this case, we will be focusing on optimising recall, as each individual case of a false positive is extremely damaging to the bank's finances via a loan default. Otherwise, false negatives are also a problem as the bank will miss out on a customer from whom they would be able to earn money, but each individual case of a false negative is not, in itself, as damaging to the bank. This philosophy thus drives our approach to determining the threshold in our logistic regression model, and ultimately the model that we will recommend for commercial use.

Thus, let us define the following two things before we continue to scaling our data:

  1. The Confusion Matrix
  2. The Metrics chart

We will additionally later define a further function to plot our Precision/Recall curve, to see where our treshold should be. We will do this after we have trained our logistic regression model.

In [103]:
#Adapted from: HR Employee Attrition Prediction notebook by Great Learning.

# Creating metric function

def metrics_score(actual, predicted):

    print(classification_report(actual, predicted))

    cm = confusion_matrix(actual, predicted)

    plt.figure(figsize = (8, 5))

    sns.heatmap(cm, annot = True, fmt = '.2f', xticklabels = ['Not Defaulted', 'Defaulted'], yticklabels = ['Not Defaulted', 'Defaulted'])
    plt.ylabel('Actual')

    plt.xlabel('Predicted')

    plt.show()
    

# Creating the Classification Performance Metrics table

def model_performance_classification(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier

    predictors: independent variables

    target: dependent variable
    """

    # Predicting using the independent variables, as well as setting our positive label
    
    pos_label = 1
    
    pred = model.predict(predictors)

    recall = recall_score(target, pred, pos_label=pos_label, average = 'macro')                 # To compute recall

    precision = precision_score(target, pred, pos_label=pos_label, average = 'macro')              # To compute precision

    acc = accuracy_score(target, pred)                                 # To compute accuracy score
    
    recall_class_1 = recall_score(target, pred, pos_label=1, average = 'binary')            #To compute recall score for our defaulting clients


    # Creating a dataframe of metrics

    df_perf = pd.DataFrame(
        {
            "Precision":  precision,
            "Recall":  recall,
            "Recall (Defaulting Loans)": recall_class_1,
            "Accuracy": acc,
        },

        index = [0],
    )

    return df_perf

# Creating the Linear Regression for Binary Classification Performance Metrics table

def model_performance_regression(model, predictors, target, threshold):
    """
    Function to compute different metrics to check linear regression for binary classification model performance

    model: lineear regression

    predictors: independent variables

    target: dependent variable
    
    threshold: decision threshold for binary classification
    """
    
    # Setting our positive label
    
    pos_label = 1
    
    # Predicting Probabilities of the target variable
    pred_prob = model.predict_proba(predictors)[:, 1]
    
    # Setting our threshold
    
    pred = (pred_prob >= threshold).astype(int)
    
    

    recall = recall_score(target, pred, pos_label=pos_label, average = 'macro')                 # To compute recall

    precision = precision_score(target, pred, pos_label=pos_label, average = 'macro')              # To compute precision

    acc = accuracy_score(target, pred)                                 # To compute accuracy score
    
    recall_class_1 = recall_score(target, pred, pos_label=1, average = 'binary')            #To compute recall score for our defaulting clients


    # Creating a dataframe of metrics

    df_perf = pd.DataFrame(
        {
            "Precision":  precision,
            "Recall":  recall,
            "Recall (Defaulting Loans)": recall_class_1,
            "Accuracy": acc,
        },

        index = [0],
    )

    return df_perf

References:¶

Sultan, A. HR Employee Attrition Prediction, MIT - Great Learning. Available at: https://olympus.mygreatlearning.com/courses/102279/files/10506888?module_item_id=5884087 (Accessed: 04 Aug 2024).

Logistic Regression¶

Now, let's scale our data for our logistic regression model, and construct our model.

In [107]:
from sklearn.preprocessing import StandardScaler
In [108]:
#initialising our scaler
scaler = StandardScaler()
In [109]:
#scaling the data
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)
In [110]:
#defining and fitting our model on our training data
logistic_reg = LogisticRegression()
logistic_reg.fit(x_train_scaled, y_train)
Out[110]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
In [111]:
#predicting the probability of our hypothesis, and selecting the probability of BAD=1
#as our target
y_pred_prob = logistic_reg.predict_proba(x_test_scaled)[:, 1]
In [112]:
y_pred_prob
Out[112]:
array([0.12298145, 0.16711751, 0.19980537, ..., 0.16059954, 0.10812457,
       0.04791157])

Now, let's plot our precision-recall curve in order to estimate where our threshold should lie.

In [114]:
from sklearn.metrics import precision_recall_curve, auc
In [115]:
precision, recall, threshold = precision_recall_curve(y_test, y_pred_prob)
In [116]:
#Adapted from: 'Logistic Regression for Binary Classification Task' by Fares Sayah, Kaggle
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g--", label="Recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.title("Precisions/recalls tradeoff")

plt.figure(figsize=(15, 10))
plot_precision_recall_vs_threshold(precision, recall, threshold)
No description has been provided for this image

References¶

Sayah, F. Logistic Regression for Binary Classification Task, Kaggle. Available at: https://www.kaggle.com/code/faressayah/logistic-regression-for-binary-classification-task (Accessed: 05 August 2024).

As we can see, in this case, the logistic regression model is not very well fit for our needs. Its Precision-Recall curve shows an optimal threshold probability to be around 0.25, which would result in a recall and accuracy score of ~0.55. This means that this model would only be able to accurately predict whether around 55% of clients would default on their loan, if in reality they will default.

Let's take a more in-depth look at the results. We can do this via our metrics and confusion matrix. However, before we do that, we need to first convert our probabilistic results into a binary form so that we may compare the true and predicted results.

In [119]:
y_pred_binary = (y_pred_prob >= 0.25).astype(int)
In [120]:
metrics_score(y_test, y_pred_binary)

lr_metrics = model_performance_regression(logistic_reg, x_test_scaled, y_test, 0.25)
lr_metrics
              precision    recall  f1-score   support

         0.0       0.87      0.86      0.87       910
         1.0       0.54      0.56      0.55       261

    accuracy                           0.80      1171
   macro avg       0.71      0.71      0.71      1171
weighted avg       0.80      0.80      0.80      1171

No description has been provided for this image
Out[120]:
Precision Recall Recall (Defaulting Loans) Accuracy
0 0.71 0.71 0.56 0.80

As we can see, the model only has a recall score of 0.56 on those clients who default on the loan, which is worse performance than what is observed in the default practices performed by the bank. Let's see if pushing back the treshold to 0.11 will have a significant impact on our recall score, as well as the overall performance of the model.

In [122]:
y_pred_binary1 = (y_pred_prob >= 0.11).astype(int)
In [123]:
metrics_score(y_test, y_pred_binary1)
              precision    recall  f1-score   support

         0.0       0.93      0.46      0.62       910
         1.0       0.32      0.88      0.47       261

    accuracy                           0.56      1171
   macro avg       0.63      0.67      0.55      1171
weighted avg       0.80      0.56      0.59      1171

No description has been provided for this image
In [124]:
lr_perf = model_performance_classification(logistic_reg, x_test_scaled, y_test)
lr_perf
Out[124]:
Precision Recall Recall (Defaulting Loans) Accuracy
0 0.77 0.62 0.28 0.81

As we can see, by lowering our treshold for the probability of someone defaulting on their loan to be classified as likely to default, we have significantly improved our recall score. However, what the model has essentially done is classified such a large chunk of clients as being likely to default that it has significantly lost its ability to predict that somebody is likely not to default. We can see above that it has taken an accuracy-by-volume approach, and classified more than half of the dataset as likely to default.

Whilst this is a good safeguard against defaulting customers, it also effectively deprives the bank of a significant portion of clients that would be able to pay off their loans. This model is thus not recommended. This model is too generalistic, and is likely to not only cause financial damage to the bank, but also likely to A: cause a significant chunk of additional administrative work, and B: could cause the bank legal problems as per their justification in turning away any given potential client.

This model will not be effective for the purposes of our study, as it, in addition to the above, does not match nor supercede the results achieved by existing practices.

We can try to further tune the hyperparameters of our model in order to maximise its recall. We will be using GridSearchCV for this, and will look at the effect of changing the following parameters:

  • Penalty: Whether we are using L1 or L2 regularisation. We can safely make the prediction that L2 regularisation will be more effective here, as we have plenty of data points and a limited number of features. However, for the purposes of experimentation, we will see try to use L1 regardless. We will also see if using both L1 and L2 penalty terms, A.K.A. ElasticNet, and no error terms will affect the performance for the better.
  • Class Weights: automatic adjustment for the imbalance in the dataset. We saw that the dataset is split roughly 80%-20% between those who paid off their loan and those who defaulted, respectively. We can, however, try and see how different class weights will affect the performance of the model.
  • C: the strength of our regularisation.
  • Solver: we will have a look at LibLinear, Saga and Newton-Cg algorithms, as we have one single target feature and this is a binary classification problem, as well as because we are trying various penalty parameters.
In [127]:
lr_tuned = LogisticRegression()

param_grid = {
    'penalty': ['l1', 'l2', 'elasticnet', 'none'],
    'class_weight': [{1: 0.5, 0: 0.5}, {1: 0.35, 0: 0.65}, {1: 0.2, 0: 0.8}],
    'C': [0.5, 0.6, 0.7, 0.8],
    'solver': ['liblinear', 'newton-cg', 'saga']
}

scorer = make_scorer(recall_score, pos_label=1)

lr_tuned_cv = GridSearchCV(
    estimator = lr_tuned,
    param_grid = param_grid,
    cv = 10,
    scoring = scorer)

lr_tuned_cv = lr_tuned_cv.fit(x_train_scaled, y_train)

lr_tuned_estimator = lr_tuned_cv.best_estimator_

lr_tuned_estimator.fit(x_train_scaled, y_train)
Out[127]:
LogisticRegression(C=0.5, class_weight={0: 0.5, 1: 0.5}, solver='liblinear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=0.5, class_weight={0: 0.5, 1: 0.5}, solver='liblinear')
In [128]:
y_pred_prob_tuned = lr_tuned_estimator.predict_proba(x_test_scaled)[:, 1]
In [129]:
precision, recall, threshold = precision_recall_curve(y_test, y_pred_prob_tuned)

plt.figure(figsize=(15, 10))
plot_precision_recall_vs_threshold(precision, recall, threshold)
No description has been provided for this image
In [130]:
y_pred_binary_tuned = (y_pred_prob >= 0.25).astype(int)
metrics_score(y_test, y_pred_binary_tuned)
lr_tuned_metrics = model_performance_regression(lr_tuned_estimator, x_test_scaled, y_test, 0.25)
lr_tuned_metrics
              precision    recall  f1-score   support

         0.0       0.87      0.86      0.87       910
         1.0       0.54      0.56      0.55       261

    accuracy                           0.80      1171
   macro avg       0.71      0.71      0.71      1171
weighted avg       0.80      0.80      0.80      1171

No description has been provided for this image
Out[130]:
Precision Recall Recall (Defaulting Loans) Accuracy
0 0.70 0.71 0.56 0.79

It appears that even when tuned, our model does not give us a level of performance that surpasses our existing approach. Therefore, let us try out various decision tree-based models, in order to see if one of them will give us a more satisfactory level of recall and overall accuracy.

We can additionally try to build a Linear and Quadratic Discriminant Analysis models, in order to see if either would perform better with our dataset.

In [133]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
In [134]:
lda = LinearDiscriminantAnalysis()
lda.fit(x_train_scaled, y_train)
Out[134]:
LinearDiscriminantAnalysis()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearDiscriminantAnalysis()
In [135]:
y_pred_train_lda = lda.predict(x_train_scaled)

metrics_score(y_train, y_pred_train_lda)
              precision    recall  f1-score   support

         0.0       0.85      0.96      0.90      3764
         1.0       0.66      0.29      0.40       918

    accuracy                           0.83      4682
   macro avg       0.75      0.62      0.65      4682
weighted avg       0.81      0.83      0.80      4682

No description has been provided for this image
In [136]:
y_pred_test_lda = lda.predict_proba(x_test_scaled)[:, 1]
In [137]:
precision, recall, threshold = precision_recall_curve(y_test, y_pred_test_lda)

plt.figure(figsize=(15, 10))
plot_precision_recall_vs_threshold(precision, recall, threshold)
No description has been provided for this image
In [138]:
y_pred_binary_lda = (y_pred_test_lda >= 0.22).astype(int)

metrics_score(y_test, y_pred_binary_lda)

lda_metrics = model_performance_regression(lda, x_test_scaled, y_test, 0.22)
lda_metrics
              precision    recall  f1-score   support

         0.0       0.87      0.87      0.87       910
         1.0       0.55      0.54      0.54       261

    accuracy                           0.80      1171
   macro avg       0.71      0.71      0.71      1171
weighted avg       0.80      0.80      0.80      1171

No description has been provided for this image
Out[138]:
Precision Recall Recall (Defaulting Loans) Accuracy
0 0.71 0.71 0.54 0.80

Observations

  • The model performs extremely well on recall for our 0 class, or clients who did not default on their loans. however, its recall for those who defaulted is 0.3, which, as we are aiming for a high recall value, is very poor. This model would incur 3.5 times as much financial damage via defaulted on loans than current practices, and therefore should not be used. It is unlikely that even via optimisation, this model would achieve a better performance than our current practices, so let's move on and see our if our QDA model performs better with default parameters.
In [140]:
qda = QuadraticDiscriminantAnalysis()

qda.fit(x_train_scaled, y_train)

y_pred_train_qda = qda.predict(x_train_scaled)

metrics_score(y_train, y_pred_train_qda)
              precision    recall  f1-score   support

         0.0       0.86      0.89      0.88      3764
         1.0       0.48      0.40      0.44       918

    accuracy                           0.80      4682
   macro avg       0.67      0.65      0.66      4682
weighted avg       0.79      0.80      0.79      4682

No description has been provided for this image
In [141]:
y_pred_test_qda = qda.predict_proba(x_test_scaled)[:, 1]
In [142]:
precision, recall, threshold = precision_recall_curve(y_test, y_pred_test_qda)

plt.figure(figsize=(15, 10))
plot_precision_recall_vs_threshold(precision, recall, threshold)
No description has been provided for this image
In [143]:
y_pred_binary_qda = (y_pred_test_qda >= 0.16).astype(int)

metrics_score(y_test, y_pred_binary_qda)
qda_metrics = model_performance_regression(qda, x_test_scaled, y_test, 0.16)
qda_metrics
              precision    recall  f1-score   support

         0.0       0.86      0.89      0.87       910
         1.0       0.56      0.48      0.52       261

    accuracy                           0.80      1171
   macro avg       0.71      0.69      0.70      1171
weighted avg       0.79      0.80      0.79      1171

No description has been provided for this image
Out[143]:
Precision Recall Recall (Defaulting Loans) Accuracy
0 0.71 0.69 0.48 0.80

Observations

  • Similarly to the last model. the QDA model performs extremely well on prediction of the 0 class, but falls flat at prediction of class 1 recall, at only 0.42.
  • The models both show a high number of false negatives, i.e. predicting that a customer will not default when, in reality, they do.
  • Let us move onto our tree-based models, and see how they stack up before ultimately making a decision on which model would be best suited for the purposes of our study.

Decision Tree¶

Note on building the models¶

During the course of this study, we will be building and comparing 5 different models, each with default and tuned hyperparameters.

Tree-based models:

  • Decision Tree Classifier
  • Random Forest Classifier

Gradient boost models:

  • Gradient Boost Classifier
  • AdaBoost Classifier
  • XGBoost Classifier

Note: These models vary in their performance as well as in their requirements in regards to computational complexity, and I will be commenting on this throughout the process. As these are decision tree models, they require only the CPU for tree construction and prediction. However, I will additionally be accelerating the process for my XGBoost model using my CUDA-supported GPU, which is a built-in feature of XGBoost by default. For the purposes of good documentation, I am performing the process of building the models on the following processor and graphics card (only for purposes of acceleration):


  • 11th Gen Intel(R) Core(TM) i7-11700F @ 2.50GHz, 2501 Mhz, 8 Core(s), 16 Logical Processor(s)
  • NVIDIA GeForce RTX 3070.

Potential time/computational power constraints should be taken into account when replicating the models, as construction and prediction time may vary depending on availability/model of GPU and the model of CPU used.

Furter details about CUDA-Accelerated Tree Construction Algorithms can be found in the official XGBoost Documentation: https://xgboost.readthedocs.io/en/release_1.3.0/gpu/index.html

Now, let's build our Decision Tree Classifier

  • As we observed in our EDA, defaulted loans appear at a frequency of 20% in the dataset, whereas loans that have been paid off represent roughly 80% of the data. We will be using the class_weight hyperparameter throughout building every model in order to ensure that this is accounted for. This can be done manually, but the 'balanced' method will perform all needed steps automatically for us (both applying inverse proportional weights and scaling them to sum up to the number of classes).
In [147]:
#Initialising our Decision Tree Classifier
dt = DecisionTreeClassifier(class_weight='balanced', random_state = 0)
#fitting the training data
dt.fit(x_train, y_train)
Out[147]:
DecisionTreeClassifier(class_weight='balanced', random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', random_state=0)
In [148]:
#let's check the performance of our tree on the training data.
train_dt_pred = dt.predict(x_train)

metrics_score(y_train, train_dt_pred)
model_performance_classification(dt, x_train, y_train)
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      3764
         1.0       1.00      1.00      1.00       918

    accuracy                           1.00      4682
   macro avg       1.00      1.00      1.00      4682
weighted avg       1.00      1.00      1.00      4682

No description has been provided for this image
Out[148]:
Precision Recall Recall (Defaulting Loans) Accuracy
0 1.00 1.00 1.00 1.00

Our model has predicted our classes perfectly based on the training data.

Let's check the performance of the model on the test data.

In [150]:
test_dt_pred = dt.predict(x_test)

metrics_score(y_test, test_dt_pred)
dt_metrics = model_performance_classification(dt, x_test, y_test)
dt_metrics
              precision    recall  f1-score   support

         0.0       0.89      0.94      0.91       910
         1.0       0.74      0.60      0.66       261

    accuracy                           0.86      1171
   macro avg       0.81      0.77      0.79      1171
weighted avg       0.86      0.86      0.86      1171

No description has been provided for this image
Out[150]:
Precision Recall Recall (Defaulting Loans) Accuracy
0 0.81 0.77 0.60 0.86

Observations

  • On our training data, our model has an overall 77% recall, with a 60% recall on our defaulted loans. This means that the model effectively misclassifies 2 in 5 people who would default on their loans as 'safe' clients.
  • This does not match the performance of existing practices, and would effectively double the losses of the bank if implemented.

Let's see the feature importances of this model, to understand more about how our tree was constructed.

In [152]:
#Adapted from: HR Employee Attrition Prediction notebook by Great Learning.

#creating a function to plot the importances of features in our decision tree

def feature_importance_plot(model):
    #assigning our importances to a variable
    importances = model.feature_importances_

    columns = x.columns
    #buildinf a df of feature importances, sotred from most to least important features
    importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)

    plt.figure(figsize = (13, 13))
    #plotting our df
    sns.barplot(x=importance_df.Importance,y=importance_df.index)

    #displaying our feature importance ratio on the right of each feature's bar for ease of comprehension
    for index, value in enumerate(importance_df['Importance']):
        plt.text(value, index, f'{value:.2f}', va='center')

    plt.title("Feature Importances in Decision Tree Model");
In [153]:
feature_importance_plot(dt)
No description has been provided for this image

Observations

  • The age of a client's oldest credit line, interestingly, seems to be a key indicator for the tree, alongside the amount of mortgage they have due.
  • The 3rd to 5th most important indicators for the decision tree seem to be the number of delinquent credit lines a client holds, their debt-to-income ratio, and, interestingly, the total value of the loan they have applied for.
  • The reason for the loan as well as the job the client holds seem to be much less relevant indicators for the tree than what we reflected in the EDA.

Let's build another decision tree, this time with limited depth, so that we can visualise its decision-making process better.

In [155]:
from sklearn import tree
features = list(x.columns)

dt_classifier_visualised = DecisionTreeClassifier(class_weight = 'balanced', max_depth = 3, random_state = 0)

dt_classifier_visualised.fit(x_train, y_train)

plt.figure(figsize = (20, 20))
tree.plot_tree(dt_classifier_visualised, feature_names = features, filled = True, fontsize = 11, class_names = ['Not Defaulted', 'Defaulted'])
plt.show()
No description has been provided for this image

Observations

  • Interestingly, when limited to a maximum depth of 3 and visualised, our decision tree has chosen the number of delinquent credit lines as its primary indicator, as seen in the root node. Those with 0 delinquent credit lines are predicted as not having defaulted, which makes sense. Those with 1 or more delinquent credit lines, on the other hand, are predicted as having defaulted.
  • Out of those with 1+ delinq. CLs, the next major indicator chosen is the No. of derogatory reports in their credit history. The split here classifies both nodes in the next level as having defaulted, then using the age of the oldest credit line as the next predictor that separates the defaulted from the non-defaulted loan clients. Interestingly, those with credit lines older than 345.74 are predicted not to default.
  • Out of those with 0 delinquent reports, the next node splits based on the age of the credit line, into two nodes which are both classified as 'Not Defaulted'. Thereafter, the only 'defaulted' leaf can be seen after a further split, in those whose debt-to-income ratio is >43.68.

Decision Tree - Hyperparameter Tuning¶

  • Hyperparameter tuning is tricky in the sense that there is no direct way to calculate how a change in the hyperparameter value will reduce the loss of your model, so we usually resort to experimentation. We'll use Grid search to perform hyperparameter tuning.
  • Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters.
  • It is an exhaustive search that is performed on the specific parameter values of a model.
  • The parameters of the estimator/model used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

Criterion {“gini”, “entropy”}

The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.

max_depth

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_leaf

The minimum number of samples is required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

You can learn about more Hyperpapameters on this link and try to tune them.

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

Having built our initial decision tree, let's use GridSearchCV in order to optimise our single tree, and see if we can achieve a higher level of recall than what has been seen in the unoptimised tree.

We will use the parameters listed above in order to tune our tree.

In [159]:
#building our decision tree for the purposes of optimising its hyperparameters.
dtree_estimator = DecisionTreeClassifier(class_weight='balanced', random_state = 1)

#assigning a grid of parameters to choose from.
parameters = {
    'max_depth': np.arange(2, 8),
    'criterion': ['gini', 'entropy'],
    'min_samples_leaf': [5, 10, 15, 20, 25]
}
#designing our scoring system for GridSearchCV
scorer = make_scorer(recall_score, pos_label=1)

#running the grid search. The reason we use 10-fold cross validation here is due to the dataset being relatively small.
#the reason we do not use stratified k-fold despite the class imbalance is because we account for the imbalance in
#our estimator.
gridCV = GridSearchCV(dtree_estimator, parameters, scoring = scorer, cv=10)

#fitting the grid search on the training dataset
gridCV = gridCV.fit(x_train, y_train)
#setting the classifier to look for the best combination of parameters
dtree_estimator = gridCV.best_estimator_

#fitting the estimator to the data
dtree_estimator.fit(x_train, y_train)
Out[159]:
DecisionTreeClassifier(class_weight='balanced', max_depth=7,
                       min_samples_leaf=10, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', max_depth=7,
                       min_samples_leaf=10, random_state=1)
In [160]:
#checking the performance of our optimised decision tree on the training data
y_train_dt_tuned_pred = dtree_estimator.predict(x_train)

metrics_score(y_train, y_train_dt_tuned_pred)
              precision    recall  f1-score   support

         0.0       0.94      0.78      0.85      3764
         1.0       0.47      0.81      0.60       918

    accuracy                           0.79      4682
   macro avg       0.71      0.80      0.73      4682
weighted avg       0.85      0.79      0.80      4682

No description has been provided for this image
In [161]:
#checking the performance of our optimised decision tree on the training data
y_test_dt_tuned_pred = dtree_estimator.predict(x_test)

metrics_score(y_test, y_test_dt_tuned_pred)

dt_tuned_metrics = model_performance_classification(dtree_estimator, x_test, y_test)
dt_tuned_metrics
              precision    recall  f1-score   support

         0.0       0.91      0.77      0.83       910
         1.0       0.47      0.73      0.57       261

    accuracy                           0.76      1171
   macro avg       0.69      0.75      0.70      1171
weighted avg       0.81      0.76      0.77      1171

No description has been provided for this image
Out[161]:
Precision Recall Recall (Defaulting Loans) Accuracy
0 0.69 0.75 0.73 0.76

Observations

  • As predicted, after hyperparameter tuning, our recall on test data in regards to defaulted loans has improved significantly: from 0.6 to 0.73.
  • Our overall model performance on training data has suffered post-hyperparameter tuning. This could be an indicator that the initial tree was overfit to the training data, and did not generalise well to unseen data. This was evidently addressed by hyperparameter optimisation.
  • The overall performance of our model has improved: however, it is still beyond the baseline of 0.8 recall reflected in the dataset. If the dataset is reflective of the bank's normal statistics, then the model is still insufficiently effective at accurately predicting who will default on their loan.

Let's have a look at the feature importances, in order to see which features the tuned tree has selected as important indicators of whether a client will default or not.

In [163]:
feature_importance_plot(dtree_estimator)
No description has been provided for this image

Observations

  • The tuned tree has selected the no. of delinq. CLs a client holds as the single most important feature to indicate whether they will default or not, followed by the age of a client's oldest credit line, their debt-to-income ratio and the number of derogatory reports on their file.
  • Total amount of mortgage due to be paid and Years spent in current role have taken a drastic fall in importance compared to the unoptimised tree.
  • The overall structure of importances seems similar to the previous, untuned tree, albeit with differences in the degrees of importances.

Building a Random Forest Classifier¶

Random Forest is a bagging algorithm where the base models are Decision Trees. Samples are taken from the training data and on each sample a decision tree makes a prediction.

The results from all the decision trees are combined together and the final prediction is made using voting or averaging.

In [166]:
#building our random forest classifier and fitting it with the training data
rf_classifier = RandomForestClassifier(class_weight = 'balanced', random_state=0)

rf_classifier.fit(x_train, y_train)
Out[166]:
RandomForestClassifier(class_weight='balanced', random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight='balanced', random_state=0)
In [167]:
#checking the performance on the training data
y_pred_rf_train = rf_classifier.predict(x_train)

metrics_score(y_train, y_pred_rf_train)
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      3764
         1.0       1.00      1.00      1.00       918

    accuracy                           1.00      4682
   macro avg       1.00      1.00      1.00      4682
weighted avg       1.00      1.00      1.00      4682

No description has been provided for this image
In [168]:
#checking the performance of the model on our training data
y_pred_rf_test = rf_classifier.predict(x_test)

metrics_score(y_test, y_pred_rf_test)
rf_metrics = model_performance_classification(rf_classifier, x_test, y_test)
rf_metrics
              precision    recall  f1-score   support

         0.0       0.90      1.00      0.95       910
         1.0       0.99      0.61      0.76       261

    accuracy                           0.91      1171
   macro avg       0.94      0.81      0.85      1171
weighted avg       0.92      0.91      0.90      1171

No description has been provided for this image
Out[168]:
Precision Recall Recall (Defaulting Loans) Accuracy
0 0.94 0.81 0.61 0.91
In [169]:
feature_importance_plot(rf_classifier)
No description has been provided for this image

Observations

  • The random forest algorithm has placed a roughly equal weight on the top 3 features it deemed to be the key deciders in this problem: the debt to income ratio, the age of the oldest credit line, and the number of delinquent credit lines a client has.
  • This is followed by the total loan size, the mortgage remaining due to be paid, and the overall number of credit lines.
  • As per the other models, the type of job role someone occupies and their reason for taking out the loan seem to provide little useful information to the model.
  • The model's macro statistics look good, with a precision of 0.94, a recall of 0.81 and an accuracy of 0.91. However, its recall for our defaulting customers is around 0.61, which means that it in nearly 2/5 cases, the model predicts that a client will not default on a loan when, in reality, they do. This is the most financially damaging to the bank, and therefore we will aim to improve upon this score.

Random Forest Classifier Hyperparameter Tuning¶

In [172]:
#building our decision tree for the purposes of optimising its hyperparameters.
rf_classifier = RandomForestClassifier(class_weight='balanced', random_state = 0)

#assigning a grid of parameters to choose from.
parameters = {
    'max_depth': [5, 7, None], 
    'n_estimators': [100, 110, 120], 
    'max_features': [0.6, 0.8, 1]
}

#designing our scoring system for GridSearchCV
scorer = make_scorer(recall_score, pos_label=1)

#running the grid search. The reason we use 10-fold cross validation here is due to the dataset being relatively small.
#the reason we do not use stratified k-fold despite the class imbalance is because we account for the imbalance in
#our estimator.
gridCV = GridSearchCV(rf_classifier, parameters, scoring = scorer, cv=10)

#fitting the grid search on the training dataset
gridCV = gridCV.fit(x_train, y_train)
#setting the classifier to look for the best combination of parameters
rf_tuned = gridCV.best_estimator_

#fitting the estimator to the data
rf_tuned.fit(x_train, y_train)
Out[172]:
RandomForestClassifier(class_weight='balanced', max_depth=5, max_features=1,
                       n_estimators=110, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight='balanced', max_depth=5, max_features=1,
                       n_estimators=110, random_state=0)
In [173]:
y_pred_rf_tuned_train = rf_tuned.predict(x_train)
metrics_score(y_train, y_pred_rf_tuned_train)
              precision    recall  f1-score   support

         0.0       0.94      0.85      0.89      3764
         1.0       0.56      0.76      0.64       918

    accuracy                           0.84      4682
   macro avg       0.75      0.81      0.77      4682
weighted avg       0.86      0.84      0.84      4682

No description has been provided for this image
In [174]:
#checking the performance of the model on our training data
y_pred_rf_test = rf_tuned.predict(x_test)

metrics_score(y_test, y_pred_rf_test)

rf_tuned_metrics = model_performance_classification(rf_tuned, x_test, y_test)
rf_tuned_metrics
              precision    recall  f1-score   support

         0.0       0.91      0.83      0.87       910
         1.0       0.54      0.70      0.61       261

    accuracy                           0.80      1171
   macro avg       0.73      0.77      0.74      1171
weighted avg       0.83      0.80      0.81      1171

No description has been provided for this image
Out[174]:
Precision Recall Recall (Defaulting Loans) Accuracy
0 0.73 0.77 0.70 0.80

Observations

  • The model has displayed a significant improvement over its predecessor. Its overall precision, recall and accuracy have decreased slightly, but it has raised its recall on defaulting customers to 0.7.
In [176]:
feature_importance_plot(rf_tuned)
No description has been provided for this image

Observations

  • The most important feature is the number of delinquent credit lines that the client holds.
  • The number of derogatory reports a client has on their credit file, their debt to income ratio and their oldest credit line age hold the next 3 orders of importance, with roughly similar weightings.

Building a Gradient Boost model¶

Gradient Boost is a boosting algorithm that effectively combines multiple weak classifiers to create a strong final classifier. It trains several weak classifiers on several subsets of data, and then combining their predictions (via voting) to create a final classifier. This will be the case for all boosting algorithms we will create going forwards.

In [179]:
gb_classifier = GradientBoostingClassifier(random_state = 1)

#fitting the classifier to our training data
gb_classifier.fit(x_train, y_train)
Out[179]:
GradientBoostingClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(random_state=1)
In [180]:
#checking the performance of the Gradient Boost model on our training data
y_pred_gb_classifier_train = gb_classifier.predict(x_train)
metrics_score(y_train, y_pred_gb_classifier_train)
              precision    recall  f1-score   support

         0.0       0.89      0.99      0.94      3764
         1.0       0.94      0.50      0.65       918

    accuracy                           0.90      4682
   macro avg       0.91      0.75      0.80      4682
weighted avg       0.90      0.90      0.88      4682

No description has been provided for this image
In [181]:
#checking the performance of the Gradient Boost model on our training data
y_pred_gb_classifier_test = gb_classifier.predict(x_test)
metrics_score(y_test, y_pred_gb_classifier_test)

gb_metrics = model_performance_classification(gb_classifier, x_test, y_test)
gb_metrics
              precision    recall  f1-score   support

         0.0       0.86      0.99      0.92       910
         1.0       0.90      0.46      0.61       261

    accuracy                           0.87      1171
   macro avg       0.88      0.72      0.77      1171
weighted avg       0.87      0.87      0.85      1171

No description has been provided for this image
Out[181]:
Precision Recall Recall (Defaulting Loans) Accuracy
0 0.88 0.72 0.46 0.87

Observations

  • The model achieves a sufficiently high overall performance on the test data. However, its recall on defaulting loans is low, equalling 0.46. This means that the model predicts less than half of customers that would have defaulted in reality to default. This could be extremely financially damaging to the bank.
In [183]:
feature_importance_plot(gb_classifier)
No description has been provided for this image

Observations

  • The model gives a significant weighting to the number of delinquent credit lines a customer holds as far as the importance of features is concerned.
  • This is followed by the debt to income ratio, the age of the oldest credit line a client holds, and the total size of the loan respectively.

Building an AdaBoost Classifier¶

In [186]:
ada_classifier = AdaBoostClassifier(random_state = 1)

#fitting the classifier to our training data
ada_classifier.fit(x_train, y_train)
Out[186]:
AdaBoostClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(random_state=1)
In [187]:
#checking the performance of the AdaBoost model on our training data
y_pred_ada_classifier_train = ada_classifier.predict(x_train)
metrics_score(y_train, y_pred_ada_classifier_train)
              precision    recall  f1-score   support

         0.0       0.88      0.97      0.92      3764
         1.0       0.76      0.44      0.56       918

    accuracy                           0.86      4682
   macro avg       0.82      0.70      0.74      4682
weighted avg       0.85      0.86      0.85      4682

No description has been provided for this image
In [188]:
#checking the performance of the AdaBoost model on our test data
y_pred_ada_classifier_test = ada_classifier.predict(x_test)
metrics_score(y_test, y_pred_ada_classifier_test)

ada_metrics = model_performance_classification(ada_classifier, x_test, y_test)
ada_metrics
              precision    recall  f1-score   support

         0.0       0.86      0.96      0.91       910
         1.0       0.78      0.44      0.56       261

    accuracy                           0.85      1171
   macro avg       0.82      0.70      0.73      1171
weighted avg       0.84      0.85      0.83      1171

No description has been provided for this image
Out[188]:
Precision Recall Recall (Defaulting Loans) Accuracy
0 0.82 0.70 0.44 0.85

Observations

  • The model performs marginally better than its Gradient Boost predecessor on Precision and Accuracy metrics, but performs significantly worse on both macro and defaulting client-specific recall metrics.
In [190]:
feature_importance_plot(ada_classifier)
No description has been provided for this image

Observations

  • The AdaBoost model has gained the most information from the years that a client spent in their job, followed by their total loan value, debt-to-income ratio, and number of credit lines they hold respectively.

Building an XGBoost Model¶

In [193]:
#building and fitting our XGB classifier to the training data. We will also precise the gpu_hist tree method in order
#to make use of the gpu to help accelerate the construction of our models. GPU accelerated prediction can additionally be
#disabled if need by setting predictor = 'cpu predictor' in order to conserve GPU memory.
xgb = XGBClassifier(device = 'cuda', tree_method = 'gpu_hist', class_weight='balanced', random_state=0)
xgb.fit(x_train, y_train)
Out[193]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              class_weight='balanced', colsample_bylevel=None,
              colsample_bynode=None, colsample_bytree=None, device='cuda',
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, feature_types=None, gamma=None,
              grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
              class_weight='balanced', colsample_bylevel=None,
              colsample_bynode=None, colsample_bytree=None, device='cuda',
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, feature_types=None, gamma=None,
              grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, ...)
In [194]:
y_pred_xgb_train = xgb.predict(x_train)

metrics_score(y_train, y_pred_xgb_train)
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      3764
         1.0       1.00      0.99      1.00       918

    accuracy                           1.00      4682
   macro avg       1.00      1.00      1.00      4682
weighted avg       1.00      1.00      1.00      4682

No description has been provided for this image
In [195]:
y_pred_xgb_test = xgb.predict(x_test)
metrics_score(y_test, y_pred_xgb_test)

xgb_metrics = model_performance_classification(xgb, x_test, y_test)
xgb_metrics
              precision    recall  f1-score   support

         0.0       0.91      0.99      0.95       910
         1.0       0.93      0.66      0.77       261

    accuracy                           0.91      1171
   macro avg       0.92      0.82      0.86      1171
weighted avg       0.91      0.91      0.91      1171

No description has been provided for this image
Out[195]:
Precision Recall Recall (Defaulting Loans) Accuracy
0 0.92 0.82 0.66 0.91

Observations

  • The model performs significantly better on the majority of stats in comparison to most of our untuned models, rivalled only by the random forest classifier in regards to precision.
  • XGB, untuned, supercedes any other untuned model in terms of recall on our defaulting customers, at 0.66. While still low, this score, combined with the overall high performance of the model, indicates that this may be the model we wish to optimise maximally in order to achieve a commercially viable product that could aid decision making in whether or not a customer is likely to default on a loan, given they are awarded one.
In [197]:
feature_importance_plot(xgb)
No description has been provided for this image

Observations

  • It appears that the number of delinquent credit lines a client holds is the single most important predictor for the model.
  • This is followed by the number of derogatory reports a client holds.
  • In stark contrast to the other models, the 3rd and 4th most important predicting features for whether a customer will default on a loan or not are whether someone's job role is in sales or self-employed, respectively. This is the first model to award this importance to a client's occupation, and is backed up by our EDA, which shows that Salespeople, for example, are much more likely to default on their loans compared to those in other occupations.

Model Optimisation Decision¶

We must now choose a model which we will optimise via hyperparameter tuning and ROC curve analysis. Two models, thus far, have been the most promising candidates for this, as their baseline performance has superceded the other models. These are: Random Forest Classifier and XGBoost Classifier. We have already optimised our random forest model, but can make further improvements, if deemed appropriate by the levels of performance seen in comparison to a tuned XGBoost Classifier.

Let's see a comparison of all the models we have built thus far in the dataframe below.

In [200]:
#creating a df with our untuned model results

models_test_comp_df = pd.concat(
    [
        lr_metrics.T,
        lda_metrics.T,
        qda_metrics.T,
        dt_metrics.T,
        dt_tuned_metrics.T,
        rf_metrics.T,
        rf_tuned_metrics.T,
        gb_metrics.T,
        ada_metrics.T,
        xgb_metrics.T
    ],
    axis = 1,
)

models_test_comp_df.columns = [
    "Linear Regression Model",
    "Linear Discriminant Analysis Model",
    "Quadratic Discriminant Analysis Model",
    "Decision Tree Classifier",
    "Tuned Decision Tree Classifier",
    "Random Forest Classifier",
    "Tuned Random Forest Classifier",
    "Gradient Boost Classifier",
    "AdaBoost Classifier",
    "XGBoost Classifier"
    ]

print("Test performance comparison:")

models_test_comp_df.T
Test performance comparison:
Out[200]:
Precision Recall Recall (Defaulting Loans) Accuracy
Linear Regression Model 0.71 0.71 0.56 0.80
Linear Discriminant Analysis Model 0.71 0.71 0.54 0.80
Quadratic Discriminant Analysis Model 0.71 0.69 0.48 0.80
Decision Tree Classifier 0.81 0.77 0.60 0.86
Tuned Decision Tree Classifier 0.69 0.75 0.73 0.76
Random Forest Classifier 0.94 0.81 0.61 0.91
Tuned Random Forest Classifier 0.73 0.77 0.70 0.80
Gradient Boost Classifier 0.88 0.72 0.46 0.87
AdaBoost Classifier 0.82 0.70 0.44 0.85
XGBoost Classifier 0.92 0.82 0.66 0.91
  • There are two key metrics we are looking for in the performance of the model: we are looking for overall accuracy, as well as a strong recall on our defaulting clients. The combination of these two parameters would allow the model to effectively assist the bank in preventing loans being lent to clients who will likely default, while still retaining a degree of accuracy that will allow the bank to serve as many viable clients as possible. The degree of accuracy, as well as an understanding of the feature selection of the model, also allows the bank to further safeguard itself legally from any potential legal challenges depending on the feature importance selection. With this understanding, the bank will be better able to justify its decision to award or not award any given client a loan.

Observations

  • The linear models, overall, performed well on the wider metrics. However, they did not perform as well as the classification models, for the most part, on our Defaulting Loans recall. This is with exception of our untuned Gradient Boost and AdaBoost models, which they surpassed in those metrics. however, they were inferior to them in overall accuracy by a significant margin.
  • Comparing the performance of the above models, as outlined, two key candidates stand out: our Tuned Random Forest model and our untuned XGBoost model. Both of them have good Recall on defaulting loans, as well as solid overall accuracy of the model. (Tuned RF Model: Recall on DL: 0.73, Accuracy: 0.76)(XGBoost Classifier: Recall on DL: 0.66, Accuracy: 0.91).

Seeing as we have optimised our Random Forest model already, let's tune the hyperparameters of our XGBoost model to see if we can get improved results.

XGBoost Hyperparameter Optimisation¶

In our XGBoost Hyperparameter optimisation, we will tune the following parameters:

  • Max Depth: the maximum depth of each tree.
  • Learning Rate: the shrinkage of step sizes to prevent overfitting.
  • No. of Estimators: number of trees built.
  • Alpha: the L1 regularisation term on weights.
  • Lambda: the L2 regularisation term on weights.
  • Column Subsampling by Tree: the subsampling ratio of columns in each tree.
  • Subsample: fraction of samples used for training each tree, helping to prevent overfitting.
  • Gamma: Additional regularisation parameter.
  • Minimum Child Weight: specifying minimum sum of instance weight needed in a child.
  • Scale positional weight: additional weight parameter which can help us with class balancing.

We will also be using randomised search in order to speed up the process, as with GridSearchCV the computational requirements would be too heavy for a single machine to complete the search in a reasonable amount of time.

In [203]:
from sklearn.model_selection import RandomizedSearchCV
In [204]:
#making a new weighted scorer, which prioritises recall on our defaulted loans, but also
#accounts for accuracy at a 9 to 1 ratio

def xgb_weighted_score(y_true, y_pred):
    recall = recall_score(y_true, y_pred, pos_label=1)
    accuracy = accuracy_score(y_true, y_pred)
    weighted_score = (3 * recall + 1 * accuracy) / 4
    return weighted_score

weighted_scorer = make_scorer(xgb_weighted_score)
In [205]:
xgb = XGBClassifier(device = 'gpu', random_state=0)

parameters_dist = {
    'max_depth': [3, 5, 7, 10],
    'learning_rate': [0.01, 0.1, 0.2, 0.3],
    'n_estimators': [50, 100, 150],
    'alpha': [0, 0.1, 1, 2],
    'lambda': [0, 0.1, 1, 2],
    'colsample_bytree': [0.3, 0.5, 0.7, 1],
    'subsample': [0.6, 0.8, 1],
    'gamma': [0, 0.1, 0.5, 1],
    'min_child_weight': [1, 3, 5],
    'scale_pos_weight': [1, 3, 5]
}

rand_search = RandomizedSearchCV(estimator=xgb, 
                                 param_distributions=parameters_dist, 
                                 n_iter=100, 
                                 scoring=weighted_scorer, 
                                 cv=10, 
                                 n_jobs=-1, 
                                 random_state=0)

rand_search.fit(x_train, y_train)

xgb_tuned = rand_search.best_estimator_

xgb_tuned.fit(x_train, y_train)
Out[205]:
XGBClassifier(alpha=0, base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.3, device='gpu', early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=0.1, grow_policy=None, importance_type=None,
              interaction_constraints=None, lambda=2, learning_rate=0.01,
              max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=10, max_leaves=None,
              min_child_weight=3, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=100, n_jobs=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(alpha=0, base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.3, device='gpu', early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=0.1, grow_policy=None, importance_type=None,
              interaction_constraints=None, lambda=2, learning_rate=0.01,
              max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=10, max_leaves=None,
              min_child_weight=3, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=100, n_jobs=None, ...)
In [206]:
#Checking the performance of the model on the training data.
y_pred_xgb_tuned_train = xgb_tuned.predict(x_train)

metrics_score(y_train, y_pred_xgb_tuned_train)
              precision    recall  f1-score   support

         0.0       1.00      0.87      0.93      3764
         1.0       0.65      0.99      0.78       918

    accuracy                           0.89      4682
   macro avg       0.82      0.93      0.86      4682
weighted avg       0.93      0.89      0.90      4682

No description has been provided for this image
In [207]:
#Checking the performance of the model on the test data.
y_pred_xgb_tuned_test = xgb_tuned.predict(x_test)

metrics_score(y_test, y_pred_xgb_tuned_test)

xgb_tuned_metrics = model_performance_classification(xgb_tuned, x_test, y_test)
xgb_tuned_metrics
              precision    recall  f1-score   support

         0.0       0.95      0.79      0.87       910
         1.0       0.54      0.86      0.67       261

    accuracy                           0.81      1171
   macro avg       0.75      0.83      0.77      1171
weighted avg       0.86      0.81      0.82      1171

No description has been provided for this image
Out[207]:
Precision Recall Recall (Defaulting Loans) Accuracy
0 0.75 0.83 0.86 0.81

Observations

  • As we can see, after hyperparameter tuning, this model has shown us the best results, by our metrics, seen to date thus far. It has a recall on defaulted loans of 0.86, which means that it will misclassify just over 1 in 10 of those who would default on a loan as 'safe' clients. It additionally has a macro accuracy of 81, which is not as good as most of our other tree-based models, but is a tradeoff we are able to make in this business case. A small loss in accuracy compared to other models is a result of the model's focus on accurate recall of our target class, a.k.a the focus on proving our alternate hypothesis.
  • This gain in recall does come at a cost: the model is still prone to classifying those that, in fact, would not have defaulted on their loan in reality into the default-prone class. In fact, the model is prone to this misclassification in just under half of its predictions. What this means is that the model is effectively more sensetive to parameters that indicate somebody could default on their loan, and, in just under 50% of cases, is too sensitive, misclassifying potential 'safe' clients.
  • Out of the overall dataset, the model correctly classifies 81% of clients, and 20% of the dataset overall is misclassified.
  • Thus far, this model has been the best-performing model we have been able to create for the purposes of this study, and is the only one we could viably recommend to be implemented for commercial use. We will go further into the justification on why this loss in accuracy is worth the gain in recall further into the conclusion. There are also some precisions I would like to make regarding the use case of this model, but as things stand, let's have a deeper look at the feature importances of this model, to get an estimate of its rationale in classifying clients into either a default-prone or non-default-prone class.
In [209]:
feature_importance_plot(xgb_tuned)
No description has been provided for this image

Observations

  • It appears that this model has selected the number of delinquent reports as its key indicator for whether an applicant will default on a loan.
  • The second most important factor appears to be the number of derogatory reports that they have on their credit file.
  • This is followed by, respectively, whether an applicant works in an office job or not, and their debt-to-income ratio.
  • It appears that this dataset has given almost every feature at least a degree of importance in its decision making process.

In order to best understand the decision-making process of this model and as per our problem statement, we can use GraphViz to visualise the best estimator in the model. While this does not give us insight into the construction of the model, we do get an understanding of how it uses feature importances in order to classify loan applicants.

In [211]:
#pip install graphviz
import os
from xgboost import plot_tree
from graphviz import Source
In [212]:
# Function created with the aid of ChatGPT

# Function to visualize a tree with graphviz, allowing control over the figure size
def plot_tree_with_graphviz(booster, num_tree=0, max_depth=3):
    # Get the tree dump in DOT format
    tree_dump = booster.get_dump(dump_format='dot', with_stats=True)[num_tree]
    
    # Insert graph attributes to control size and direction
    tree_dump = 'digraph {\nsize="20,20";\nrankdir=UT;\n' + tree_dump[tree_dump.find('\n')+1:]
    
    # Create a Source object from the DOT format tree
    tree_graph = Source(tree_dump)
    
    # Render the tree graph to view it
    tree_graph.view()

# Example usage with xgb_tuned
num_trees = xgb_tuned.get_booster().num_boosted_rounds()
plot_tree_with_graphviz(xgb_tuned.get_booster(), num_tree=num_trees-1, max_depth=3)

Output sample image from PDF: image.png We can also display the entire chart using HTML embedding, as seen below. (IMPORTANT NOTE: In order to see the full interactive tree chart, the code above must be executed, and the PDF file will be generated on the machine locally within the same folder as the notebook.)

In [214]:
#Adapted from Batfan's answer on stackoverflow: 
#https://stackoverflow.com/questions/291813/recommended-way-to-embed-pdf-in-html

from IPython.display import HTML

file_path = 'Source.gv.pdf'

# Display PDF with HTML
HTML(f'<embed src="{file_path}" width="800" height="600" type="application/pdf">')
Out[214]:

References¶

  • OpenAI. (2023). ChatGPT (Aug 06 version) [Large language model]. https://chat.openai.com/chat
  • Answer by Batfan (2024). Recommended way to embed PDF in HTML? [online] Stack Overflow. Available at: https://stackoverflow.com/questions/291813/recommended-way-to-embed-pdf-in-html [Accessed 7 Aug. 2024].

Conclusion¶

Comparison of techniques, and their relative performance based on chosen Metric¶

1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):

  • Our chosen performance metric for our models, for the purposes of this study, is comprised of a combination of the ability to single out and accurately predict clients who would default on an awarded loan, whilst still retaining overall ability to tell apart customers who would be able to pay off their loan. Therefore, the metric we will be choosing is a combination of recall on defaulting loans and overall model accuracy.
  • The recall on defaulting loans here is the priority, as it allows the bank to filter out customers who would default on their loans if their application was approved. This is key, as loan defaults, if allowed to accumulate, cause significant financial damage to the banking institution. The overall accuracy, while secondary, is still important, as it is needed to ensure that the bank's application process does not filter out a significant portion of applicants who would, in reality, be able to pay off their loan.
  • As seen across our models, tree-based classification models have performed better than linear models overall. Whilst linear models had decent overall accuracy, they had struggled to capture a the information that would help them with recall on out defaulting loan applicants by comparison.
  • Within our tree-based models, two models surpassed the rest on untuned results as per our metrics: Random Forest Classifier and XGBoost Classifier.
  • Once tuned, the XGBoost Classifier showed us better results when it came to both the ability to maximise our defaulting applicant recall, as well as overall accuracy of the model. Whilst both are ensemble learning algorithms, Random Forest is a bagging algorithm, which does not learn from its previous trees built. XGBoost, meanwhile, as a boosting algorithm, seems to have captured an additional degree of information from all of the data provided to it, and has performed significantly better than the rest of the models.
  • The tuned XGBoost model also performed better than, by what we can surmise, the bank's standard practices, delivering a higher level of recall on defaulting loans and accuracy than what we observe in the dataset itself.
  • Additionally, the tree-based models offer much more explainability to their decision making process than the linear models do, giving them another advantage. The tree-based models, whether singular or ensemble, offer ways to look into the decision-making process of their best estimators (best trees). In our final tuned XGBoost model, we can see an example of this: the model allows us a look at its decision making process. As per the Equal Credit Opportunity Act (ECOA), the bank is thus more easily able to justify its rationale behind not awarding a loan to an individual, all whilst maintaining a higher accuracy of decision making than standard, human discretionary practices provide.

2. Refined insights:

  • It is key to go beyond assessing someone's job title in order to figure out the impact their current employment may have on their ability to pay off a loan. Crucially, it is important to understand why someone's employment may have an effect on this. Roles in which total income is highly performance-based are more likely to default on a loan than those where the income is steady, as a single bad year may impact the ability to pay off a loan. This can be seen in the EDA, where salespeople and self-employed individuals have been shown to be more likely to default on a loan, proportionally, than other categories.
  • There is a clear correlation between an individual's credit history and their likelihood to pay off a loan. As proven both in the EDA and in our models, invividuals with a more spotted credit file are more likely to default on a loan than those who have consistently shown themselves to be able to repay loans on time. This can be seen in our models, as the more robust models prioritise the number of delinquent credit lines and the number of derogatory reports an individual holds as a reliable indicator of whether someone will pay off a loan.
  • If the dataset is reflective of the bank's current practices, this means that the established procedures for loan application reviews could be improved upon significantly. 20% of the loans lent to clients have been defaulted on, meaning that the bank has sustained circa USD22,179,544 in damages from the defaulted loans within this dataset alone. It is likely that over the course of a year/several years, this accumulates and is able to cause significant damage to the bank's business model.
  • Data gathering, for the purposes of similar studies, needs to be improved if the bank/banks wish to improve their overall decision-making process. Specifically, as mentioned before, the job that an applicant holds may have a significant impact on their ability to pay off a loan. Therefore, it would be valuable to garner a further understanding of the 'other' job category. It would be useful to unpack this category, potentially through a process of individual inputs for every applicant, rather than lumping the vast majority of job roles into a single category. It should be noted that if this was to be done, this would make the process of data cleaning and preprocessing more challenging. However, this can be simplified significantly via the use of large language models, the integration of which with data science is only accelerating. This would allow us to derive much more information from the feature, and to build more accurate profiles of applicants who would be likely/unlikely to default on a loan.
  • It would additionally be useful to understand an individual's annual income, as well as their rough annual expenditures. As we saw, in some cases, an individual's debt-to-income level serves as a relatively strong predictor for their ability to pay off a loan. This data would allow us further insight into their financial situation, and therefore their ability to pay off any given loan.

3. Proposal for the final solution design:

  • For the purposes of predicting whether an applicant will ultimately default on a loan or not, I would recommend the use of our Tuned XGBoost final model as an aid to the human decision-making process. The exact use case I would recommend for the model is the following:

  • The model has a strong level of recall on defaulting customers, as well as a solid overall level of accuracy. Its recall on defaulting customers is superior to the bank's standard practices as it stands, as per the ratio of defaulted to paid off loans in the dataset. Therefore, I would recommend using the model as an initial sorter. If the model predicts that a customer is unlikely to default on a loan, there is a significant chance that they would be a safe customer. Therefore, these applicants should be subject to reduced checks, saving the application team significant amounts of time and allowing for reduced bureaucratic bottlenecks in the loan application process for clearly eligible applicants.

  • Applicants who the model says are likely to default should, for now, continue to undergo further checks. The reason for this is the model's eagerness, or bias, to predict that a customer may default on the loan. With a precision of 0.54 on defaulted loans, almost half of all applicants deemed likely to default on a loan will, in fact, be customers who would otherwise be able to pay their loan. A human eye on every application falling into this category would ensure that the bank misses our on a minimal amount of 'safe' customers, whilst simeltaneously aiding the team in identifying risky customers.

  • Therefore, as it stands, the model is a significant aid for the team rather than a complete, automated solution. However, if implemented, it could simeltaneously decrease the proportion of defaulted loans by almost a third on its own, whilst simeltaneously reducing the workload for the applications processing team by as much as nearly 75% (basic checks on predicted 'safe' applicants notwithstanding).

  • In order to further improve the model, two particular steps could be done. Each have their own implementation, their own upsides and their own drawbacks.

  1. The data collection process could be further improved, as outlined above. Whilst this may make the process of application more complicated initially, and prove more of a challenge for the data science team, it would ultimately have a strong impact in a model's ability to accurately predict whether the customer would default on a loan or not, as well as provide more details regarding every applicant.
  2. The model itself can be further upgraded in one of two ways: its hyperparameters could be further tuned using GridSearchCV, rather than RandomisedSearchCV. However, this would come at a significant computational cost. Having tried to train the model via this method myself, this is not viable within a reasonable timeframe without commercial use of GPUs that are dedicated to the purposes of training machine learning models. The second method would be to build a "super-learner" model, or a model that effectively uses multiple ensemble learning models to create a super-ensemble model, the output of which would use the results of the previous models. The challenge with building such a model, however, is two-fold. Firstly, it, as discussed previously, is not viable without hardware that is specifically designed to train machine learning models. Secondly, and more crucially for a large banking institution: it would incur a loss in explainability despite better performance. Therefore, it would be more difficult for the bank to apply this model under the legal framework of ECOA, as justification behind rejections would be more difficult to rationalise.

References¶

  • Sultan, A. HR Employee Attrition Prediction, MIT - Great Learning. Available at: https://olympus.mygreatlearning.com/courses/102279/files/10506888?module_item_id=5884087 (Accessed: 04 Aug. 2024).
  • Sayah, F. Logistic Regression for Binary Classification Task, Kaggle. Available at: https://www.kaggle.com/code/faressayah/logistic-regression-for-binary-classification-task (Accessed: 05 August 2024).
  • OpenAI. (2023). ChatGPT (Aug 06 version) [Large language model]. https://chat.openai.com/chat (Accessed: 06 Aug. 2024)
  • Answer by Batfan (2024). Recommended way to embed PDF in HTML? [online] Stack Overflow. Available at: https://stackoverflow.com/questions/291813/recommended-way-to-embed-pdf-in-html (Accessed 7 Aug. 2024).